Senior Site Reliability Engineer, Tenant Services: Geo
Job Description
Team: IT
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, Tenant Services: Geo in India.
This role is focused on ensuring the reliability, scalability, and operational excellence of large-scale distributed systems that support data replication and disaster recovery workflows for enterprise customers. You will join a high-impact SRE team responsible for executing and improving complex migration and cutover processes in a SaaS environment. The position blends deep infrastructure engineering with hands-on operational work, including incident response, automation, and observability. You will help ensure that critical customer data migrations are safe, repeatable, and increasingly low-risk over time. Working in a fully remote, global setup, you will collaborate closely with multiple engineering, support, and infrastructure teams. The environment is fast-paced, highly collaborative, and driven by strong engineering values and automation-first thinking.
Accountabilities:
- Execute end-to-end data migrations and cutovers, including planning, validation, execution, and post-cutover verification and cleanup activities.
- Participate in on-call rotations and shift coverage to handle incidents, ensure system availability, and support live migration events across global time zones.
- Operate and improve replication and migration systems, including data hygiene checks, validation workflows, and escalation handling.
- Design and maintain automation, tooling, and runbooks to reduce operational complexity and make processes repeatable and reliable.
- Build and enhance observability systems, including monitoring, alerting, dashboards, and SLO tracking for migrations and system health.
- Collaborate with multiple engineering and support teams to improve reliability, capacity planning, and disaster recovery processes.
- Contribute to incident response, post-incident reviews, and root cause analysis, ensuring learnings are converted into long-term improvements.
- Continuously reduce operational toil through automation and process optimization.
- Strong experience operating large-scale, highly available distributed systems in a SaaS or cloud environment.
- Hands-on experience with major cloud platforms, including networking, compute, storage, and managed services.
- Solid Kubernetes experience, including deployment, troubleshooting, and ecosystem tooling such as Helm.
- Proficiency with infrastructure as code and configuration tools such as Terraform, Ansible, or Chef.
- Strong programming ability in at least one language (preferably Go or Ruby) plus scripting skills in Python or Shell.
- Experience with observability stacks such as Prometheus, Grafana, and logging systems for troubleshooting and performance analysis.
- Exposure to data replication, backup/restore, or migration scenarios where data integrity and downtime risk are critical.
- Experience working in on-call environments and handling production incidents under pressure.
- Strong communication skills with the ability to engage customers during migrations and incidents.
- Ability to work independently in a remote, asynchronous environment with strong ownership mindset.
- Clear problem-solving skills with a focus on long-term system improvements and not just short-term fixes.
- Flexible Paid Time Off
- Equity compensation and Employee Stock Purchase Plan
- Growth and Development Fund
- Parental leave
- Home office support
- Team Member Resource Groups
- Global remote-first working environment
- Inclusive and values-driven culture
Requirements:
Benefits:
Explore More
Date Posted
04/10/2026
Views
0
Similar Jobs
Staff Backend Engineer, AST: Composition Analysis - Jobgether
Views in the last 30 days - 0
View DetailsSenior Financial Analyst - Technology Finance - Jobgether
Views in the last 30 days - 0
View Details