Staff Site Reliability Engineer

Jobgether · India

Company

Jobgether

Location

India

Type

Full Time

Job Description

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Site Reliability Engineer in India.

In this high-impact role, you will help define and scale reliability standards for mission-critical SaaS platforms used by global enterprise customers. You will work at the intersection of engineering, infrastructure, and operations, ensuring systems remain highly available, performant, and resilient at scale. The role goes beyond traditional operations, focusing on building automated, self-healing systems and reliability platforms. You will lead incident response for critical outages while driving long-term architectural improvements. Acting as a technical force multiplier, you will influence engineering practices across teams and mentor engineers on SRE excellence. This is a highly collaborative environment where reliability, observability, and automation are core engineering principles. You will play a central role in enabling 99.99% uptime objectives in a fast-scaling global organization.

Accountabilities:

  • Lead the design and implementation of reliability frameworks and self-service platforms that enable engineering teams to own the stability of their services, including “You Build It, You Run It” models.
  • Act as Incident Commander during high-severity incidents, coordinating cross-functional response efforts, ensuring rapid resolution, and driving effective blameless postmortems.
  • Architect and enhance observability solutions using modern tooling such as Prometheus, Grafana, and OpenTelemetry to improve detection and performance insights.
  • Drive automation and AIOps initiatives to enable proactive detection, diagnostics, and remediation of system failures.
  • Establish reliability engineering best practices across teams through production readiness reviews, design collaboration, and operational standards.
  • Mentor engineers across SRE and product teams, strengthening technical capabilities and promoting a culture of operational excellence.
  • Continuously improve system scalability, resilience, and performance across distributed cloud environments.
  • Requirements

    The ideal candidate brings extensive experience in Site Reliability Engineering or DevOps within high-scale SaaS environments. You should have strong programming or scripting skills in languages such as Python, Go, or similar, and hands-on experience with cloud infrastructure such as Amazon Web Services, Google Cloud Platform, or Microsoft Azure. Deep understanding of distributed systems, networking fundamentals (TCP/IP, DNS, HTTP/S), and Kubernetes is essential. You should have proven experience managing production incidents, leading postmortems, and improving system reliability. Strong observability experience with logging, monitoring, and tracing tools is expected. Excellent communication skills, problem-solving ability, and calm decision-making under pressure are critical. Prior experience mentoring engineers and influencing architecture decisions in complex environments is highly valued.

    Benefits

    • Competitive compensation aligned with senior-level impact and market benchmarks
    • Fully remote-friendly and flexible work arrangements
    • Opportunity to work on large-scale, globally distributed systems
    • Strong focus on learning, mentorship, and technical growth
    • Exposure to cutting-edge SRE, observability, and automation practices
    • Inclusive and collaborative engineering culture
    • Health, wellness, and employee support programs (varies by location).
Apply Now

Date Posted

05/18/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0
142,000+ Jobs Tracked
12,400+ Companies
1,930 Categories