Sr. Site Reliability Engineer at Jobgether

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Sr. Site Reliability Engineer in United States.

This role provides a high-impact opportunity to ensure the stability, scalability, and reliability of critical cloud services across a large-scale production environment. You will combine hands-on technical expertise with strategic ownership, driving automation, monitoring, and incident response to deliver consistently high-performing systems. Working closely with engineering, product, and operations teams, you will influence system design, embed reliability practices, and lead cross-functional initiatives that reduce operational toil. The ideal candidate thrives in a collaborative, fast-paced environment, enjoys solving complex problems, and has deep experience with modern cloud infrastructure, automation, and distributed systems.

Accountabilities:

Own and drive the availability, durability, and performance of key services across all production environments
Lead complex technical projects from discovery to resolution, demonstrating high-level ownership
Define, implement, and enforce service health standards, including SLIs, SLOs, and error budget policies
Lead incident response, post-incident reviews, and implement long-term reliability improvements and architectural enhancements
Mentor team members and act as a subject matter expert in ITIL/OSS processes, including incident, change, problem, and capacity management
Architect and deploy scalable automation solutions to reduce manual tasks and improve operational efficiency
Maintain and improve monitoring, logging, alerting frameworks, and CI/CD pipelines using tools like Prometheus, Grafana, ELK, Terraform, Ansible, and Jenkins
Collaborate with engineering, product, and operations teams on resilient system design, capacity planning, disaster recovery, and vendor management
Develop and maintain operational playbooks, runbooks, and documentation to promote a reliability-first culture

Requirements:

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
8+ years of progressive experience in site reliability, systems engineering, or operations
Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems
Expert-level Linux administration and advanced troubleshooting skills
Proficiency in at least one modern scripting/programming language (Python or Go strongly preferred)
Experience with container orchestration platforms (Kubernetes, Docker) and microservices architecture
Expertise with infrastructure-as-code and Hashicorp tools (Terraform, Vault, Nomad)
Strong understanding of incident response, root cause analysis, and operational best practices
Knowledge of ITIL/OSS practices, SLIs/SLOs, and cloud platforms (AWS, GCP, Azure)
Excellent problem-solving, collaboration, and communication skills, with a proactive approach to operational improvements

Benefits:

Competitive salary range of $150,000 – $200,000, plus RSU grants and ESPP program
Comprehensive healthcare coverage, including dental and vision
Flexible vacation policy, maternity/paternity leave, and childcare bonuses
MacBook Pro and generous stipend to personalize your workstation
Fertility treatment support and learning & development programs
Commuter benefits and a culture supporting a healthy work-life balance
Opportunities to work in a diverse, inclusive, and globally distributed team

Sr. Site Reliability Engineer

Company

Location

Type

Job Description

Accountabilities:

Requirements:

Benefits:

Explore More

Date Posted

Views

Similar Jobs

Staff Software Engineer (Web Automation & Open Banking Infrastructure) - Jobgether

Sr. Sales Engineer, New Business - Jobgether

Senior Software Engineer - Jobgether

Vice President, Public Sector Growth - Jobgether

Vice President, Customer Success - Jobgether

Vendor Account Coordinator - Jobgether