Staff Platform Reliability Engineer

· Remote

Location

Remote

Type

Full Time

Job Description

Staff Platform Reliability Engineer

Reposted 6 Hours Ago
Easy Apply
Hiring Remotely in US
Remote or Hybrid
185K-230K Annually
Senior level
Artificial Intelligence • Machine Learning
Unleash data science one innovation at a time.
The Role
Own and modernize Domino's Tempest scale-testing platform; build repeatable automated validation sizing guidance and cloud-scale test automation; partner with platform teams to enable multi-cloud scale testing and improve test reliability and reporting.
Summary Generated by Built In

Who we are

At Domino we build software that helps the largest AI-driven organizations build and operate advanced data science and AI solutions at scale. Our platform integrates a streamlined model development environment MLOps capabilities and novel features for collaboration reuse and reproducibility — all of which make data science teams more productive reduce time to value and ensure compliance. Our customers — like Johnson & Johnson GSK Bristol Myers UBS FINRA and the US Navy — are using our software to solve some of the most important challenges in the world such as developing new medicines securing our financial markets or protecting our country. Backed by Sequoia Capital Coatue Management NVIDIA Snowflake and other leading investors we have been in business for a decade but are still a small team operating with the spirit of a startup. Especially in the world of AI today we believe that the future is still being invented — and we want to be the ones building it. For more information visit www.domino.ai

What we are building

The Automation Team at Domino acts as a force multiplier for engineering building the tools and systems that enable teams to ship code confidently and consistently. A core part of this mission is Tempest an in-house platform that orchestrates realistic long-duration workloads against live Kubernetes clusters and validates the results against real observability data. Today when scale testing surfaces a bottleneck a resource misconfiguration or a regression in system behavior the team can identify and report the issue — but we need someone who can take the next step: profiling services tracing root causes through Prometheus and New Relic data and partnering with platform engineers to drive durable fixes. Focused on iteration and continuous improvement the team looks for targeted enhancements that create outsized impact and this role will close the gap between detection and resolution at the infrastructure level.

What your impact will be

In your first year you will:

  • Serve as the technical owner of Tempest Domino's scale and reliability platform ensuring it remains reliable extensible and aligned with evolving infrastructure needs
  • Diagnose and drive resolution of performance bottlenecks and resource misconfigurations surfaced by scale testing — working directly with platform and infrastructure teams to ship fixes not just file tickets
  • Deliver accurate data-driven sizing recommendations for customer-facing documentation based on rigorous empirical testing across deployment sizes
  • Strengthen observability across scale testing by improving Prometheus and New Relic instrumentation making it faster to pinpoint root causes during and after multi-day load runs
  • Establish and operationalize scale testing on cloud platforms ensuring appropriate sizing and configuration guidance for this increasingly divergent product line
  • Partner with platform teams to enable effective scale and reliability testing across additional cloud providers helping position Domino for future multi-cloud success
  • Increase the efficiency and leverage of a small team by building infrastructure automation that scales operationally as the product and customer base grow

What we look for in this role

  • Background in SRE platform engineering or infrastructure with hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments
  • Strong proficiency in Python and comfort working in a large modular codebase that spans orchestration infrastructure automation and systems integration
  • Experience with observability stacks (Prometheus Grafana New Relic or similar) — writing queries building dashboards and using metrics to diagnose performance and reliability issues at the systems level
  • Demonstrated ability to go beyond detection to resolution: profiling services identifying resource bottlenecks and working with engineering teams to ship durable fixes
  • Familiarity with performance and load testing methodologies (e.g. Locust k6 or similar) as part of a broader infrastructure or reliability practice
  • Clear ownership mindset — self-directed accountable and able to communicate priorities and status effectively in a remote async environment

What we value

  • We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
  • We believe in individuals who seek truth and speak the truth and can be their whole selves at work
  • We value all of you that believe improving is always possible At Domino Everything is a work in progress – we can do better at everything
  • We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company
  • We strongly believe in the value of growing a diverse team and encourage people of all backgrounds genders ethnicities abilities and sexual orientations to apply

#LI-Remote

The annual US base salary range for this role is listed below. For sales roles the range provided is the role's On Target Earnings ("OTE") range meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range will be narrowed during the interview process based on a number of factors including the candidate's experience qualifications and location. Additional benefits for this role may include: equity company bonus or sales commissions/bonuses; 401(k) plan; medical dental and vision benefits; and wellness stipends.

Compensation Range
$185000$230000 USD

Top Skills

Ci Systems
Cloud Platforms
Cloud-Native Tooling
End-To-End Frameworks
Kubernetes
Multi-Cloud
Performance/Load Testing Frameworks
Python
Tempest

What the Team is Saying

Claus Murmann
Melissa Smith
Am I A Good Fit?
beta
Expert contributor network
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco CA
200 Employees
Year Founded: 2013

What We Do

Domino Data Lab powers model-driven businesses with its leading Enterprise AI platform trusted by over 20% of the Fortune 100. Domino accelerates the development and deployment of data science work while increasing collaboration and governance. With Domino enterprises worldwide can develop better medicines grow more productive crops build better cars and much more. Founded in 2013 Domino is backed by Coatue Management Great Hill Partners Highland Capital Sequoia Capital and other leading investors. For more information visit www.domino.ai

Why Work With Us

We’re looking for sharp scrappy people who crave a high degree of ownership are laser-focused on personal growth and can stick the landing between high standards and low ego. In our fast-paced environment you’ll find all the white space and opportunity you need to thrive.

Gallery

Domino Data Lab Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

Typical time on-site: Flexible
HQSan Francisco CA
London UK
Argentina (Remote Hub)
Learn more

Similar Jobs

Domino Data Lab

Content Marketing Manager

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
110K-120K Annually

Domino Data Lab

Revenue Accountant

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
Washington CA USA
200 Employees
100K-120K Annually

Domino Data Lab

Principal Product Manager

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
250K-300K Annually

Domino Data Lab

Senior Customer Success Manager

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
200K-230K Annually
Apply Now

Date Posted

04/03/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0
142,000+ Jobs Tracked
12,400+ Companies
1,930 Categories