Sr. Site Reliability Engineer

Jobgether · US

Company

Jobgether

Location

US

Type

Full Time

Job Description

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Sr. Site Reliability Engineer in United States.

This role provides a high-impact opportunity to ensure the stability, scalability, and reliability of critical cloud services across a large-scale production environment. You will combine hands-on technical expertise with strategic ownership, driving automation, monitoring, and incident response to deliver consistently high-performing systems. Working closely with engineering, product, and operations teams, you will influence system design, embed reliability practices, and lead cross-functional initiatives that reduce operational toil. The ideal candidate thrives in a collaborative, fast-paced environment, enjoys solving complex problems, and has deep experience with modern cloud infrastructure, automation, and distributed systems.

Accountabilities:

  • Own and drive the availability, durability, and performance of key services across all production environments
  • Lead complex technical projects from discovery to resolution, demonstrating high-level ownership
  • Define, implement, and enforce service health standards, including SLIs, SLOs, and error budget policies
  • Lead incident response, post-incident reviews, and implement long-term reliability improvements and architectural enhancements
  • Mentor team members and act as a subject matter expert in ITIL/OSS processes, including incident, change, problem, and capacity management
  • Architect and deploy scalable automation solutions to reduce manual tasks and improve operational efficiency
  • Maintain and improve monitoring, logging, alerting frameworks, and CI/CD pipelines using tools like Prometheus, Grafana, ELK, Terraform, Ansible, and Jenkins
  • Collaborate with engineering, product, and operations teams on resilient system design, capacity planning, disaster recovery, and vendor management
  • Develop and maintain operational playbooks, runbooks, and documentation to promote a reliability-first culture

  • Requirements:

    • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
    • 8+ years of progressive experience in site reliability, systems engineering, or operations
    • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems
    • Expert-level Linux administration and advanced troubleshooting skills
    • Proficiency in at least one modern scripting/programming language (Python or Go strongly preferred)
    • Experience with container orchestration platforms (Kubernetes, Docker) and microservices architecture
    • Expertise with infrastructure-as-code and Hashicorp tools (Terraform, Vault, Nomad)
    • Strong understanding of incident response, root cause analysis, and operational best practices
    • Knowledge of ITIL/OSS practices, SLIs/SLOs, and cloud platforms (AWS, GCP, Azure)
    • Excellent problem-solving, collaboration, and communication skills, with a proactive approach to operational improvements

    • Benefits:

      • Competitive salary range of $150,000 – $200,000, plus RSU grants and ESPP program
      • Comprehensive healthcare coverage, including dental and vision
      • Flexible vacation policy, maternity/paternity leave, and childcare bonuses
      • MacBook Pro and generous stipend to personalize your workstation
      • Fertility treatment support and learning & development programs
      • Commuter benefits and a culture supporting a healthy work-life balance
      • Opportunities to work in a diverse, inclusive, and globally distributed team
Apply Now

Date Posted

03/25/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0
142,000+ Jobs Tracked
12,400+ Companies
1,930 Categories