Staff Site Reliability Engineer (SRE) | Dev Ops Engineer

· Remote

Location

Remote

Type

Full Time

Job Description

Staff Site Reliability Engineer (SRE) | Dev Ops Engineer

Posted 3 Hours Ago
Be an Early Applicant
Menlo Park CA USA
Hybrid
169K-224K Annually
Senior level
Artificial Intelligence • Big Data • Healthtech • Machine Learning • Software • Biotech
GRAIL is a healthcare company whose mission is to detect cancer early when it can be cured.
The Role
Lead the design and operation of a fault-tolerant cloud infrastructure implement infrastructure-as-code manage Kubernetes reliability and mentor engineers.
Summary Generated by Built In
Our mission is to detect cancer early when it can be cured. We are working to change the trajectory of cancer mortality and bring stakeholders together to adopt innovative safe and effective technologies that can transform cancer care.

We are a healthcare company pioneering new technologies to advance early cancer detection. We have built a multi-disciplinary organization of scientists engineers and physicians and we are using the power of next-generation sequencing (NGS) population-scale clinical studies and state-of-the-art computer science and data science to overcome one of medicine’s greatest challenges.

GRAIL is headquartered in the bay area of California with locations in Washington D.C. North Carolina and the United Kingdom. It is supported by leading global investors and pharmaceutical technology and healthcare companies.

For more information please visit grail.com

GRAIL is seeking a Staff Site Reliability / DevOps Engineer to lead the reliability scalability and security of our cloud-native platform. This role operates at the intersection of infrastructure engineering platform strategy and organizational leadership supporting systems that power large-scale data processing and cutting-edge cancer detection technologies.

You will define and drive infrastructure standards across teams represent reliability and performance in architecture decisions and build systems that scale well beyond your direct ownership. This is a highly technical high-impact role combining hands-on engineering with cross-functional influence and mentorship.

Flexible – MPK or RTP (3 days in office)
This is a hybrid role based in either Menlo Park CA (moving to Sunnyvale CA in Fall 2026) or Durham NC. Our current flexible work arrangement policy requires that a minimum of 60% or 24 hours of your total work week be on-site. Your specific schedule determined in collaboration with your manager will align with team and business needs and could exceed the 60% requirement for the site.

 

Reponsibilities

    • Design build and operate highly available fault-tolerant cloud infrastructure across AWS GCP and/or Azure
    • Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery
    • Lead infrastructure-as-code adoption and maturity using tools such as Terraform CloudFormation and Ansible
    • Own Kubernetes reliability across multi-cluster environments including upgrades scaling and workload lifecycle management
    • Establish and evolve observability platforms (metrics logs traces) and define SLO/SLI frameworks across teams
    • Lead incident response for critical outages drive root cause analysis and implement preventative improvements
    • Optimize infrastructure for cost performance and scalability partnering closely with engineering and finance stakeholders
    • Define and enforce DevOps reliability and security best practices across the organization
    • Partner cross-functionally with engineering data QA security and IT teams to design resilient systems
    • Mentor engineers and contribute to technical leadership through design reviews standards and knowledge sharing
    • These responsibilities summarize the role’s primary responsibilities and are not an exhaustive list. They may change at the company’s discretion.

      What Success Looks Like in Your First Year
      • Conduct a comprehensive assessment of the current infrastructure drive infrastructure-as-code adoption to 95%+ across critical systems and establish clear health and reliability baselines for the Kubernetes platform
      • Standardize observability using modern tooling and implement an SLO/SLI framework adopted across multiple product teams including defined SLAs for critical data systems
      • Strengthen security and compliance posture across cloud environments by implementing consistent baselines launching a compliance-as-code framework and reducing mean time to resolution (MTTR) for production incidents
      • Define document and drive adoption of engineering standards best practices and operational guidelines across platform and product teams
      • Develop and align stakeholders on a forward-looking platform reliability and infrastructure roadmap
      • Demonstrate measurable mentorship and technical leadership impact across the engineering organization
      • Evaluate and provide recommendations on emerging infrastructure needs including support for AI/ML and advanced data workloads

Required Qualifications

    • BS in Computer Science Engineering or related field or equivalent experience
    • 8+ years of experience in Site Reliability Engineering DevOps or platform engineering
    • Strong hands-on experience with at least one major cloud platform (AWS GCP or Azure)
    • Experience implementing infrastructure-as-code solutions (Terraform CloudFormation or similar)
    • Experience designing and operating CI/CD pipelines (e.g. GitLab CI GitHub Actions Jenkins)
    • Hands-on experience with Kubernetes and containerized systems in production environments
    • Proficiency in scripting or programming for automation (e.g. Python Go Bash or PowerShell)
    • Experience with observability and monitoring tools (e.g. Prometheus Grafana OpenTelemetry Datadog)
    • Strong understanding of networking security and distributed systems fundamentals
    • Experience working in regulated environments and familiarity with frameworks such as ISO 27001 NIST SOC 2 or HIPAA

Preferred Qualifications

    • 10+ years of experience in SRE DevOps or infrastructure engineering
    • Experience operating multi-cluster Kubernetes environments (e.g. EKS GKE) at scale
    • Familiarity with GitOps practices (e.g. ArgoCD Flux)
    • Experience with data platforms and pipelines (e.g. Kafka Airflow Spark Snowflake BigQuery)
    • Experience implementing SLO/SLI frameworks and reliability practices across multiple teams
    • Strong background in cloud security including IAM zero-trust architecture and secrets management
    • Experience with compliance-as-code and security tooling (e.g. OPA Snyk Checkov)
    • Exposure to AI/ML or large-scale data infrastructure workloads
    • Experience in healthcare biotech or other regulated industries
    • Relevant cloud or Kubernetes certifications (e.g. AWS DevOps CKA/CKS GCP DevOps)

Physical Demands and Working Environment

  • Standard office environment with hybrid flexibility
  • Participation in on-call rotation and after-hours support for critical systems may be required
  • Frequent collaboration with cross-functional and senior stakeholders
  • Fast-paced dynamic environment with emphasis on reliability scalability and innovation

Adaptability and Growth Expectation

    As the organization evolves responsibilities may expand or shift to meet business needs. This may include:
    • Taking on additional technical or leadership responsibilities
    • Participating in cross-functional initiatives and strategic projects
    • Adapting to new technologies tools and methodologies
    • Supporting other teams during periods of high demand

The expected full-time annual base pay scale for this position is $169K - $224K  for Durham NC.  Actual base pay will consider skills experience and location.

This role may be eligible for other forms of compensation including an annual bonus and/or incentives subject to the terms of the applicable plans and Company discretion. This range reflects a good-faith estimate of the range that the Company reasonably expects to pay for the position upon hire; the actual compensation offered may vary depending on factors such as the candidate’s qualifications. Employees in this role are also eligible for GRAIL’s comprehensive and competitive benefits package offered in accordance with our applicable plans and policies. This package currently includes flexible time-off or vacation; a 401(k) retirement plan with employer match; medical dental and vision coverage; and carefully selected mindfulness programs.

GRAIL is an equal employment opportunity employer and we are committed to building a workplace where every individual can thrive contribute and grow. All qualified applicants will receive consideration for employment without regard to race color religion national origin sex gender gender identity sexual orientation age disability status as a protected veteran or any other class or characteristic protected by applicable federal state and local laws. Additionally GRAIL will consider for employment qualified applicants with arrest and conviction records in a manner consistent with applicable law and provide reasonable accommodations to qualified individuals with disabilities. Please contact us at [email protected] if you require an accommodation to apply for an open position.

GRAIL maintains a drug-free workplace. We welcome job-seekers from all backgrounds to join us!

Top Skills

Ansible
AWS
Azure
Bash
CloudFormation
Datadog
GCP
Github Actions
Gitlab Ci
Go
Grafana
Jenkins
Kubernetes
Opentelemetry
Powershell
Prometheus
Python
Terraform

What the Team is Saying

Neda Ronaghi
Ruth Mauntz
Tristan Matthews
David Jenions
Satnam Alag
Am I A Good Fit?
beta
Expert contributor network
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Menlo Park CA
918 Employees
Year Founded: 2016

What We Do

GRAIL is a healthcare company whose mission is to detect cancer early when it can be cured. GRAIL is using the power of high-intensity sequencing population-scale clinical studies and state-of-the-art computer science and data science to enhance the scientific understanding of cancer biology and to develop and commercialize pioneering products.

Why Work With Us

Everything we do is guided by our mission to detect cancer early when it can be cured. It’s the reason we’re here and it’s no small task. The right people make all the difference. That’s why we’re looking for those who strive to share their knowledge contribute their skills inspire each other and commit to something bigger than themselves.

Gallery

GRAIL Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

GRAIL has a variety of work types depending on the roles. Some roles are onsite like a lab role some are fully remote like our Galleri Sales Consultant roles. Others are hybrid with 2-3 days onsite. Typically Tuesday and Thursday.

Typical time on-site: 2 days a week
Company Office Image
HQMenlo Park CA
Company Office Image
London GB
Company Office Image
Raleigh NC
Company Office Image
Washington DC
Learn more

Similar Jobs

GRAIL

Senior Quality Engineer Complaint Handling # 4699

Artificial Intelligence • Big Data • Healthtech • Machine Learning • Software • Biotech
Hybrid
2 Locations
918 Employees
109K-144K Annually

GRAIL

Commercial Analytics and Forecasting Manager #4604

Artificial Intelligence • Big Data • Healthtech • Machine Learning • Software • Biotech
Hybrid
Menlo Park CA USA
918 Employees
135K-179K Annually

GRAIL

Customer Service Coordinator (Inbound) #4535

Artificial Intelligence • Big Data • Healthtech • Machine Learning • Software • Biotech
Hybrid
Menlo Park CA USA
918 Employees
24-29 Hourly

GRAIL

Clinical Trial Associate # 4730

Artificial Intelligence • Big Data • Healthtech • Machine Learning • Software • Biotech
Hybrid
Menlo Park CA USA
918 Employees
33-39 Hourly
Apply Now

Date Posted

04/22/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0
142,000+ Jobs Tracked
12,400+ Companies
1,930 Categories