Senior Site Reliability Engineer

Jobgether · Canada

Company

Jobgether

Location

Canada

Type

Full Time

Job Description

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in Canada.

This role sits at the core of a fast-scaling, AI-driven intelligence platform, where reliability is not just operational support but a strategic enabler of product innovation. You will design and own the foundations that ensure large-scale, mission-critical systems remain observable, resilient, and performant under demanding AI and data workloads. Acting as a senior individual contributor, you will shape reliability standards, SLO frameworks, and multi-region architecture while directly influencing engineering decisions across the organization. The environment is highly technical, collaborative, and innovation-focused, with a strong emphasis on AI-native systems and automation-first thinking. You will work across software, AI engineering, and platform teams to ensure seamless delivery of complex services. This is a hands-on leadership role for someone who wants to define how modern AI infrastructure operates at scale.

Accountabilities

  • You will define and own service reliability standards, including SLOs, SLIs, and error budgets, ensuring consistent performance across all production systems.
  • You will design and implement reliability patterns for AI agent pipelines, including observability, failure detection, and safe degradation mechanisms.
  • You will architect and improve multi-region infrastructure strategies, driving high availability, disaster recovery readiness, and blast radius control.
  • You will lead incident response and postmortem processes, ensuring durable fixes and continuous improvement of system resilience.
  • You will serve as the primary reliability partner for engineering and AI teams, influencing architecture, deployment strategies, and system design decisions.
  • You will own observability and platform tooling, including service catalog management, Datadog configuration, and AI workload monitoring.
  • You will develop CI/CD standards and enable self-service developer platforms to improve deployment velocity and system reliability.
  • You will contribute to FinOps initiatives by improving cost visibility and optimizing infrastructure efficiency across cloud environments.

  • Requirements

    • You bring 6–8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering, with senior-level technical ownership responsibilities.
    • You have deep expertise in AWS and distributed systems architecture, including multi-region, high-availability environments.
    • You are highly skilled in Kubernetes, Docker, Terraform, and GitOps practices, with strong infrastructure-as-code experience.
    • You have hands-on experience with observability platforms such as Datadog, including SLO monitoring, alerting, tracing, and log analytics.
    • You are proficient in scripting and development (Python and/or Bash), with solid understanding of microservices architectures.
    • You have strong experience designing and optimizing CI/CD pipelines (e.g., GitHub Actions, Bitbucket Pipelines).
    • You understand reliability challenges in large-scale systems and can translate complex technical risks into actionable engineering solutions.
    • You have strong communication and collaboration skills, with the ability to influence cross-functional teams and mentor engineers.
    • Experience with AI/ML infrastructure, LLM systems, or agent-based architectures is a strong advantage.

    • Benefits

      • Competitive compensation in the range of $125,200 – $132,500 CAD.
      • Comprehensive benefits package including health, dental, vision, and wellness coverage.
      • RRSP matching and annual fitness reimbursement.
      • Flexible vacation policy and remote-first work arrangement within Canada.
      • Access to professional training, development programs, and high-growth career opportunities.
      • Wellness resources and employee support programs.
      • Inclusive, diverse, and accessibility-focused work environment.
      • Opportunities to work on cutting-edge AI and large-scale data infrastructure systems.
Apply Now

Date Posted

04/29/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0
142,000+ Jobs Tracked
12,400+ Companies
1,930 Categories