Job Description
Team: IT
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Sr. Site Reliability Engineer in United States.
This role provides a high-impact opportunity to ensure the stability, scalability, and reliability of critical cloud services across a large-scale production environment. You will combine hands-on technical expertise with strategic ownership, driving automation, monitoring, and incident response to deliver consistently high-performing systems. Working closely with engineering, product, and operations teams, you will influence system design, embed reliability practices, and lead cross-functional initiatives that reduce operational toil. The ideal candidate thrives in a collaborative, fast-paced environment, enjoys solving complex problems, and has deep experience with modern cloud infrastructure, automation, and distributed systems.
Accountabilities:
- Own and drive the availability, durability, and performance of key services across all production environments
- Lead complex technical projects from discovery to resolution, demonstrating high-level ownership
- Define, implement, and enforce service health standards, including SLIs, SLOs, and error budget policies
- Lead incident response, post-incident reviews, and implement long-term reliability improvements and architectural enhancements
- Mentor team members and act as a subject matter expert in ITIL/OSS processes, including incident, change, problem, and capacity management
- Architect and deploy scalable automation solutions to reduce manual tasks and improve operational efficiency
- Maintain and improve monitoring, logging, alerting frameworks, and CI/CD pipelines using tools like Prometheus, Grafana, ELK, Terraform, Ansible, and Jenkins
- Collaborate with engineering, product, and operations teams on resilient system design, capacity planning, disaster recovery, and vendor management
- Develop and maintain operational playbooks, runbooks, and documentation to promote a reliability-first culture
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
- 8+ years of progressive experience in site reliability, systems engineering, or operations
- Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems
- Expert-level Linux administration and advanced troubleshooting skills
- Proficiency in at least one modern scripting/programming language (Python or Go strongly preferred)
- Experience with container orchestration platforms (Kubernetes, Docker) and microservices architecture
- Expertise with infrastructure-as-code and Hashicorp tools (Terraform, Vault, Nomad)
- Strong understanding of incident response, root cause analysis, and operational best practices
- Knowledge of ITIL/OSS practices, SLIs/SLOs, and cloud platforms (AWS, GCP, Azure)
- Excellent problem-solving, collaboration, and communication skills, with a proactive approach to operational improvements
- Competitive salary range of $150,000 – $200,000, plus RSU grants and ESPP program
- Comprehensive healthcare coverage, including dental and vision
- Flexible vacation policy, maternity/paternity leave, and childcare bonuses
- MacBook Pro and generous stipend to personalize your workstation
- Fertility treatment support and learning & development programs
- Commuter benefits and a culture supporting a healthy work-life balance
- Opportunities to work in a diverse, inclusive, and globally distributed team
Requirements:
Benefits:
Explore More
Date Posted
03/25/2026
Views
0
Similar Jobs
Staff Software Engineer (Web Automation & Open Banking Infrastructure) - Jobgether
Views in the last 30 days - 0
View Details