Senior DevOps / Platform Reliability Engineer at Jobgether

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior DevOps / Platform Reliability Engineer in the United States.

This role sits at the intersection of platform engineering, SRE, and AI-driven operations, supporting a next-generation intelligent automation platform used by enterprise-scale customers. You will be responsible for building and evolving the infrastructure backbone that powers production AI and multi-agent systems at scale. The environment is highly technical and fast-moving, requiring strong ownership of CI/CD, cloud infrastructure, observability, and security. You will work closely with engineering teams to ensure safe, reliable, and scalable deployments across complex distributed systems. A key aspect of the role involves integrating modern AI tools into DevOps workflows to reduce operational toil and improve system intelligence. This is a high-impact position where your work directly shapes platform reliability, developer velocity, and production safety.

Accountabilities:

Own and evolve CI/CD pipelines using modern tools such as GitHub Actions, ensuring safe, scalable, and reversible deployments for microservices and AI workloads
Design and manage Infrastructure as Code solutions using Terraform and CloudFormation to automate provisioning and environment consistency
Operate and scale Kubernetes-based infrastructure (EKS + Argo CD), including autoscaling, ingress, security controls, and multi-tenant isolation
Manage cloud networking and edge infrastructure including Cloudflare, AWS networking services, API gateways, load balancers, and DNS configurations
Oversee data and event infrastructure such as Aurora MySQL, Redis, S3, and Kafka (MSK), ensuring reliability, backups, and disaster recovery readiness
Build and maintain serverless and event-driven systems using AWS Lambda where appropriate
Develop observability platforms using Prometheus, Grafana, and OpenTelemetry, including telemetry for AI/LLM systems and agentic workflows
Strengthen security and compliance posture (SOC 2, HIPAA) through IAM design, secrets management, scanning, and policy-as-code enforcement
Drive FinOps initiatives including cost optimization, workload attribution, and LLM usage cost control
Partner with engineering teams to define deployment standards, operational SLOs, and platform best practices
Improve system reliability through monitoring, incident response, automation, and continuous infrastructure improvements
Document infrastructure, processes, and operational standards to enable scalability and knowledge sharing

Requirements:

5+ years of experience in DevOps, SRE, or Platform Engineering supporting production systems on AWS
Strong hands-on experience with CI/CD systems such as GitHub Actions, GitLab CI, Jenkins, or CircleCI
Deep experience operating Kubernetes environments (EKS preferred), including scaling, upgrades, and production operations
Strong AWS networking knowledge including VPC design, routing, security groups, load balancing, and DNS management
Proficiency with Terraform and Infrastructure as Code practices, ideally using OIDC-based authentication
Experience with production databases and storage systems including Aurora/RDS MySQL, Redis, and S3
Strong observability expertise using Prometheus, Grafana, and OpenTelemetry
Experience with Argo CD for GitOps-based deployments
Strong understanding of Cloudflare and AWS edge/networking services
Experience with Kafka/MSK and event-driven architectures
Strong scripting skills in Python, Bash, and Linux environments
Solid understanding of security practices including IAM, KMS, secrets management, and supply chain security
Experience with compliance and vulnerability scanning tools
Ability to work independently while collaborating effectively in high-ownership engineering teams

Benefits:

Competitive compensation package
100% employer-covered employee health premiums
75%–80% coverage for dependent health, dental, and vision plans
401(k) retirement plan
Paid parental leave
Unlimited PTO policy
Fully remote work flexibility across the United States
Up to $200/month co-working space reimbursement
Home office stipend up to $500 for setup
Monthly $100 stipend for internet, phone, and related expenses
Opportunity to work on cutting-edge AI-native infrastructure and agentic systems
High-autonomy engineering culture focused on ownership and innovation

Senior DevOps / Platform Reliability Engineer

Company

Location

Type

Job Description

Accountabilities:

Requirements:

Benefits:

Explore More

Date Posted

Views

Similar Jobs

Senior Software Engineer I - App Observability - Jobgether

Test Engineer III / Quality Assurance Engineer - Jobgether

Sr/Staff Data Engineer - Jobgether

Senior Video Editor - Jobgether

Senior Technical Support Agent - Jobgether

Senior Compliance Manager - Jobgether