Staff Site Reliability Engineer at Jobgether

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Site Reliability Engineer in India.

In this high-impact role, you will help define and scale reliability standards for mission-critical SaaS platforms used by global enterprise customers. You will work at the intersection of engineering, infrastructure, and operations, ensuring systems remain highly available, performant, and resilient at scale. The role goes beyond traditional operations, focusing on building automated, self-healing systems and reliability platforms. You will lead incident response for critical outages while driving long-term architectural improvements. Acting as a technical force multiplier, you will influence engineering practices across teams and mentor engineers on SRE excellence. This is a highly collaborative environment where reliability, observability, and automation are core engineering principles. You will play a central role in enabling 99.99% uptime objectives in a fast-scaling global organization.

Accountabilities:

Lead the design and implementation of reliability frameworks and self-service platforms that enable engineering teams to own the stability of their services, including “You Build It, You Run It” models.
Act as Incident Commander during high-severity incidents, coordinating cross-functional response efforts, ensuring rapid resolution, and driving effective blameless postmortems.
Architect and enhance observability solutions using modern tooling such as Prometheus, Grafana, and OpenTelemetry to improve detection and performance insights.
Drive automation and AIOps initiatives to enable proactive detection, diagnostics, and remediation of system failures.
Establish reliability engineering best practices across teams through production readiness reviews, design collaboration, and operational standards.
Mentor engineers across SRE and product teams, strengthening technical capabilities and promoting a culture of operational excellence.
Continuously improve system scalability, resilience, and performance across distributed cloud environments.

Requirements

The ideal candidate brings extensive experience in Site Reliability Engineering or DevOps within high-scale SaaS environments. You should have strong programming or scripting skills in languages such as Python, Go, or similar, and hands-on experience with cloud infrastructure such as Amazon Web Services, Google Cloud Platform, or Microsoft Azure. Deep understanding of distributed systems, networking fundamentals (TCP/IP, DNS, HTTP/S), and Kubernetes is essential. You should have proven experience managing production incidents, leading postmortems, and improving system reliability. Strong observability experience with logging, monitoring, and tracing tools is expected. Excellent communication skills, problem-solving ability, and calm decision-making under pressure are critical. Prior experience mentoring engineers and influencing architecture decisions in complex environments is highly valued.

Benefits

Competitive compensation aligned with senior-level impact and market benchmarks
Fully remote-friendly and flexible work arrangements
Opportunity to work on large-scale, globally distributed systems
Strong focus on learning, mentorship, and technical growth
Exposure to cutting-edge SRE, observability, and automation practices
Inclusive and collaborative engineering culture
Health, wellness, and employee support programs (varies by location).

Staff Site Reliability Engineer

Company

Location

Type

Job Description

Accountabilities:

Requirements

Benefits

Explore More

Date Posted

Views

Similar Jobs

Staff Fullstack Engineer - Data Products - Jobgether

Video Player Engineer - Jobgether

Technical Support Engineer - Jobgether

Technical Assistance Center Engineer - Jobgether

Advisory Partner Solutions Architect - Mongodb

Video Editor - Jobgether