Job Description
Team: IT
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Sr Site Reliability Engineer based in India.
This role sits at the core of a rapidly scaling observability platform that powers how engineering teams monitor, debug, and optimize complex distributed systems. You will be responsible for ensuring the reliability, scalability, and performance of a large-scale SaaS infrastructure that processes massive volumes of observability data. The environment is highly technical and deeply hands-on, requiring strong instincts for diagnosing production issues and preventing them at scale. You will work closely with platform, data, and product engineering teams to maintain and evolve a petabyte-scale system built on modern cloud-native technologies. The role involves owning uptime, performance, and operational excellence across Kubernetes-based infrastructure and high-throughput data pipelines. This is an opportunity to shape the backbone of a globally used open-source product trusted by thousands of engineering teams.
Accountabilities:
In this role, you will own the operational stability and scalability of a large distributed observability platform while continuously improving system performance, reliability, and automation. You will:
- Design, operate, and improve large-scale Kubernetes infrastructure including upgrades, scaling, networking, and multi-tenancy
- Ensure system reliability through strong SRE practices including SLOs, SLIs, error budgets, incident response, and on-call optimization
- Scale and maintain high-throughput ingestion pipelines handling petabyte-scale observability data
- Operate, tune, and optimize data systems such as ClickHouse for performance, cost efficiency, and reliability
- Build automation and tooling using infrastructure-as-code and CI/CD to improve deployment and operational efficiency
- Monitor, debug, and resolve complex production issues across distributed systems
- Improve observability of the platform itself using modern monitoring, logging, and tracing practices
- 5–8 years of experience in SRE, infrastructure, platform engineering, or backend systems roles
- Deep hands-on expertise with Kubernetes in production-scale environments
- Strong understanding of distributed systems, failure modes, performance tuning, and capacity planning
- Experience working with high-scale data systems (ClickHouse, Kafka, or similar) is highly desirable
- Proficiency in at least one programming language (Go strongly preferred) with a focus on automation and system reliability
- Familiarity with observability concepts and tools such as OpenTelemetry, metrics, logs, and traces
- Strong problem-solving skills with the ability to debug complex production issues
- Excellent communication skills with the ability to write clear documentation and runbooks
- Experience in fast-paced, high-ownership, remote-first environments
- Open-source contributions or strong engagement with OSS ecosystems is a plus
- Competitive salary package ranging from ₹50L to ₹1Cr annually
- Fully remote, India-based role with flexible, async-friendly working culture
- High ownership role with direct impact on a globally used open-source platform
- Opportunity to work on petabyte-scale distributed systems and cutting-edge observability infrastructure
- Strong engineering culture focused on shipping, reliability, and continuous improvement
- Exposure to modern cloud-native technologies including Kubernetes, ClickHouse, and OpenTelemetry
- Collaborative, high-caliber team environment with strong technical peers
- Opportunity to contribute to a fast-growing open-source ecosystem used by thousands of engineering teams
Requirements:
This role requires strong experience in building and operating large-scale distributed systems with a deep focus on reliability and performance. You should bring:
Benefits:
Explore More
Date Posted
06/24/2026
Views
0
Similar Jobs
Software Development Engineer II ( Java Backend ) - Jobgether
Views in the last 30 days - 0
View Details