Job Description
Team: IT
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer II based in Brazil.
This role is focused on ensuring the reliability, scalability, and performance of large-scale production systems that power a global cloud storage platform. You will play a key part in maintaining service health, improving observability, and driving automation to reduce operational toil across critical infrastructure. The environment is highly collaborative, working closely with engineering, product, and operations teams to embed reliability practices throughout the software lifecycle. You will contribute directly to incident response, postmortems, and long-term system improvements that enhance customer experience at global scale. This is a hands-on engineering role where you will work with modern cloud-native technologies, CI/CD pipelines, and infrastructure-as-code practices. It is an opportunity to help shape resilient systems used by hundreds of thousands of customers worldwide.
Accountabilities:
- Ensure the availability, reliability, and durability of critical production services across distributed environments.
- Monitor system health using SLIs, SLOs, and error budgets, proactively identifying risks to service performance.
- Participate in on-call rotations, incident response, root cause analysis, and post-incident reviews to drive continuous service improvement.
- Build automation to reduce operational toil and improve efficiency of recurring infrastructure and support tasks.
- Develop and maintain observability systems, including monitoring, logging, and alerting frameworks.
- Work with CI/CD pipelines, infrastructure-as-code tools, and configuration management systems to support reliable deployments.
- Write scripts and tooling (e.g., Python, Bash, Go) to improve system operations and reliability.
- Collaborate with engineering and operations teams to design and maintain resilient systems and improve operational maturity.
- Contribute to capacity planning, disaster recovery planning, and vendor/SLA management activities.
- Document systems, create runbooks, and help foster a reliability-first engineering culture.
- 2–4 years of experience in Site Reliability Engineering, systems engineering, DevOps, or production operations roles.
- Strong Linux systems administration and troubleshooting skills in production environments.
- Solid understanding of reliability engineering principles, including monitoring, alerting, incident response, and root cause analysis.
- Proficiency in at least one scripting language such as Python, Bash, or Go.
- Experience with containers and orchestration technologies such as Docker and Kubernetes.
- Familiarity with CI/CD tools and infrastructure automation (e.g., Terraform, Ansible, Jenkins).
- Understanding of distributed systems and microservices architectures.
- Experience working with cloud platforms such as AWS, GCP, or Azure is highly valued.
- Knowledge of ITIL/OSS practices, including incident, change, and problem management, is a plus.
- Strong problem-solving skills, ownership mindset, and ability to work independently in complex environments.
- Excellent collaboration and communication skills for cross-functional work with technical and non-technical teams.
- 💼 Full-time remote position available across Latin America
- 💵 Competitive compensation aligned with experience
- 📈 Opportunity to work on large-scale, globally distributed cloud infrastructure
- 🧠 Exposure to modern SRE practices, tools, and high-availability systems
- 🏠 Flexible work arrangements and remote-friendly culture
- 🌎 International, diverse, and collaborative engineering environment
- 🚀 Strong focus on learning, ownership, and career growth
- 🤝 Inclusive culture with emphasis on belonging, diversity, and equity
Requirements:
Benefits:
Explore More
Date Posted
06/24/2026
Views
0
Similar Jobs
Software Engineer - FullStack Pleno (Java/React) - Jobgether
Views in the last 30 days - 0
View Details