Senior DevOps / Platform Reliability Engineer

Jobgether · US

Company

Jobgether

Location

US

Type

Full Time

Job Description

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior DevOps / Platform Reliability Engineer in the United States.

This role sits at the intersection of platform engineering, SRE, and AI-driven operations, supporting a next-generation intelligent automation platform used by enterprise-scale customers. You will be responsible for building and evolving the infrastructure backbone that powers production AI and multi-agent systems at scale. The environment is highly technical and fast-moving, requiring strong ownership of CI/CD, cloud infrastructure, observability, and security. You will work closely with engineering teams to ensure safe, reliable, and scalable deployments across complex distributed systems. A key aspect of the role involves integrating modern AI tools into DevOps workflows to reduce operational toil and improve system intelligence. This is a high-impact position where your work directly shapes platform reliability, developer velocity, and production safety.

Accountabilities:

  • Own and evolve CI/CD pipelines using modern tools such as GitHub Actions, ensuring safe, scalable, and reversible deployments for microservices and AI workloads
  • Design and manage Infrastructure as Code solutions using Terraform and CloudFormation to automate provisioning and environment consistency
  • Operate and scale Kubernetes-based infrastructure (EKS + Argo CD), including autoscaling, ingress, security controls, and multi-tenant isolation
  • Manage cloud networking and edge infrastructure including Cloudflare, AWS networking services, API gateways, load balancers, and DNS configurations
  • Oversee data and event infrastructure such as Aurora MySQL, Redis, S3, and Kafka (MSK), ensuring reliability, backups, and disaster recovery readiness
  • Build and maintain serverless and event-driven systems using AWS Lambda where appropriate
  • Develop observability platforms using Prometheus, Grafana, and OpenTelemetry, including telemetry for AI/LLM systems and agentic workflows
  • Strengthen security and compliance posture (SOC 2, HIPAA) through IAM design, secrets management, scanning, and policy-as-code enforcement
  • Drive FinOps initiatives including cost optimization, workload attribution, and LLM usage cost control
  • Partner with engineering teams to define deployment standards, operational SLOs, and platform best practices
  • Improve system reliability through monitoring, incident response, automation, and continuous infrastructure improvements
  • Document infrastructure, processes, and operational standards to enable scalability and knowledge sharing
  • Requirements:

    • 5+ years of experience in DevOps, SRE, or Platform Engineering supporting production systems on AWS
    • Strong hands-on experience with CI/CD systems such as GitHub Actions, GitLab CI, Jenkins, or CircleCI
    • Deep experience operating Kubernetes environments (EKS preferred), including scaling, upgrades, and production operations
    • Strong AWS networking knowledge including VPC design, routing, security groups, load balancing, and DNS management
    • Proficiency with Terraform and Infrastructure as Code practices, ideally using OIDC-based authentication
    • Experience with production databases and storage systems including Aurora/RDS MySQL, Redis, and S3
    • Strong observability expertise using Prometheus, Grafana, and OpenTelemetry
    • Experience with Argo CD for GitOps-based deployments
    • Strong understanding of Cloudflare and AWS edge/networking services
    • Experience with Kafka/MSK and event-driven architectures
    • Strong scripting skills in Python, Bash, and Linux environments
    • Solid understanding of security practices including IAM, KMS, secrets management, and supply chain security
    • Experience with compliance and vulnerability scanning tools
    • Ability to work independently while collaborating effectively in high-ownership engineering teams
    • Benefits:

      • Competitive compensation package
      • 100% employer-covered employee health premiums
      • 75%–80% coverage for dependent health, dental, and vision plans
      • 401(k) retirement plan
      • Paid parental leave
      • Unlimited PTO policy
      • Fully remote work flexibility across the United States
      • Up to $200/month co-working space reimbursement
      • Home office stipend up to $500 for setup
      • Monthly $100 stipend for internet, phone, and related expenses
      • Opportunity to work on cutting-edge AI-native infrastructure and agentic systems
      • High-autonomy engineering culture focused on ownership and innovation
Apply Now

Date Posted

05/12/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0

© 2026 Job Transparency. All rights reserved.