Principal Site Reliability Engineer at Jobgether

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Principal Site Reliability Engineer in Canada.

This role offers the opportunity to own and enhance the reliability, scalability, and security of a complex cloud infrastructure supporting mission-critical workloads. You will work hands-on across multi-region AWS/EKS environments, partnering with engineering leads, ML and simulation teams, and customer-facing teams to drive operational excellence. This position requires deep technical expertise, strong problem-solving skills, and the ability to take end-to-end ownership of large-scale infrastructure projects. You will lead incident response, implement automated remediation, and guide cloud architecture decisions while optimizing performance, security, and cost. The role is ideal for someone who thrives in a fast-paced, high-autonomy environment and enjoys shaping infrastructure strategies that directly impact customer success.

Accountabilities:

Own and evolve cloud infrastructure, ensuring high availability, reliability, and scalability across AWS and EKS environments
Manage Kubernetes clusters, including node pool strategy, AMI lifecycle, autoscaling, and workload health monitoring
Lead incident response, root cause analysis, and implement systemic fixes to reduce MTTR
Oversee cloud security and access management, including IAM governance and compliance readiness
Collaborate with cross-functional teams to drive infrastructure design, cost optimization, and next-generation deployment strategies
Support CI/CD pipelines, GitOps workflows, and developer experience to enable efficient deployment and troubleshooting
Manage networking, VPC design, DNS, load balancing, and cross-region connectivity to support enterprise workloads
Provide guidance on infrastructure automation, observability, and monitoring systems

Requirements:

5+ years in SRE, DevOps, or infrastructure engineering roles
Strong AWS experience (EKS, EC2, IAM, S3, VPC, CloudFront, KMS) and Kubernetes expertise (cluster operations, node pools, RBAC, Helm, autoscaling)
Infrastructure-as-code proficiency, preferably with Terraform, including state management and multi-environment patterns
Experience with GitOps, CI/CD pipelines (ArgoCD, GitHub Actions, Jenkins), monitoring, and observability tools (Prometheus, Grafana, Elasticsearch)
Solid networking fundamentals, including CIDR design, security groups, DNS, VPN, load balancing, and cross-region connectivity
Proficiency in Python and Bash for automation and tooling; familiarity with Linux and Windows environments
Strong ownership, problem-solving, and critical thinking skills; ability to prioritize and execute in a high-impact environment
Excellent communication skills to collaborate with engineering, product, and customer teams

Benefits:

Competitive salary and total compensation package
Flexible remote work within Canada
Health and wellness support, including medical and dental coverage
Generous paid time off and parental leave programs
Professional growth opportunities through mentorship, learning, and challenging projects
Exposure to cutting-edge cloud infrastructure, SRE practices, and high-performance computing workloads

Principal Site Reliability Engineer

Company

Location

Type

Job Description

Accountabilities:

Explore More

Date Posted

Views

Similar Jobs

Product Reliability Engineer - Jobgether

Staff Engineer (Platform) - Jobgether

Senior Engineer (Product) Canada - Jobgether

Senior AI Platform Engineer - Jobgether

AI / ML Engineer - Jobgether

Senior Security Engineer, Vulnerability Automation - Jobgether