Senior Manager, DevOps

· Remote

Location

Remote

Type

Full Time

Job Description

Senior Manager DevOps

Posted Yesterday
Be an Early Applicant
Hiring Remotely in San Francisco CA USA
In-Office or Remote
150K-220K Annually
Senior level
Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
TrueML is a fintech company building software to create positive experiences for consumers seeking financial health.
The Role
Lead infrastructure and platform engineering for cloud architecture CI/CD standards and scalability of machine learning products while managing a team of DevOps engineers.
Summary Generated by Built In
TrueML Products is seeking a highly experienced and strategic Sr. Manager DevOps to lead our infrastructure and platform engineering efforts. This role is critical in driving our cloud architecture strategy establishing elite CI/CD standards and ensuring the scalability and reliability of our machine learning-driven products.
 
Reporting to the Sr. Director Program & Operations you will lead the evolution of our internal developer platform and infrastructure-as-code (IaC) architecture. The ideal candidate is a hands-on leader with a "systems-thinking" mindset. We are looking for a visionary who thrives on solving complex distributed systems challenges and considers leveraging GenAI and AIOps tooling second-nature for optimizing system performance and automation.

What You'll Do (Technical Leadership & Strategy):

  • Define and execute the long-term strategic vision for Infrastructure as Code (IaC) CI/CD evolution and cloud-native architecture to support TrueML’s scaling needs.
  • Lead the design and implementation of self-service internal platforms to reduce developer cognitive load enabling feature teams to deploy and manage services with minimal friction at increased velocity.
  • Act as the primary stakeholder for cloud spend (AWS); drive cost-optimization initiatives and lead contract negotiations for the DevOps toolstack and third-party vendors.
  • Ensure the infrastructure architecture supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols maintaining system integrity across multiple regions.
  • Oversee the implementation and evolution of comprehensive monitoring logging and distributed tracing systems leveraging AIOps to move from reactive to predictive system maintenance.
  • Champion security by design by integrating automated vulnerability scanning secret management and compliance checks directly into the automated build pipelines.
  • Serve as the ultimate escalation point for major production outages facilitating blameless post-mortem reviews that focus on systemic improvements rather than individual error.
  • Maintain deep technical currency in container orchestration (Kubernetes) serverless patterns and modern automation frameworks to provide meaningful mentorship and architectural guidance to senior engineering staff.

What You'll Do (Hands-On Engineering & Technical Execution):

  • Maintain the ability to write and review high-quality code in languages like Python Go or Bash to automate complex operational tasks and system integrations.
  • Hands-on development of Terraform  Infrastructure as Code for resource provisioning.
  • Directly architect and troubleshoot complex CI/CD workflows (GitHub Actions ArgoCD Atlantis) ensuring build-and-deploy cycles are optimized for speed and reliability.
  • Proactively manage and tune container orchestration environments including hands-on configuration of Ingress controllers declarative GitOps workflows and cluster autoscaling.
  • Lead from the front during critical incidents by conducting deep-dive technical analysis across the EKS stack troubleshooting Node-level kernel panics VPC CNI networking bottlenecks and RDS performance constraints to minimize MTTR
  • Conduct hands-on audits of cloud configurations and IAM policies implementing "least privilege" access controls and automated remediation scripts.
  • Directly manage the integration and API configurations between various tools in the DevOps stack (e.g. connecting Jira VictorOps Slack and Observe for seamless incident flow).

What You'll Do (People Leadership & Engineering Collaboration):

  • Recruit hire and develop a world-class team of DevOps Engineers; provide career pathing and technical mentorship to foster a culture of continuous learning.
  • Partner closely with Engineering Managers to align infrastructure deliverables with product roadmap ensuring DevOps is an accelerator rather than a bottleneck.
  • Collaborate with the Quality Engineering and Security leadership to define and enforce "Definition of Done" standards that include automated testing and security gates.
  • Set clear measurable goals (KPIs and OKRs) for the team conducting regular performance reviews and providing feedback to drive individual and collective excellence.
  • Lead internal Brunch & Learns to educate the broader engineering organization on modern cloud-native patterns and self-service capabilities.

Who You Are (Qualifications):

  • Bachelor's degree in Computer Science Engineering or a related technical field or equivalent practical experience.
  • 10+ years of experience in DevOps Site Reliability Engineering (SRE) or Software Engineering; 5+ years of experience managing engineers
  • Expert-level mastery with AWS and experience managing multi-region high-availability deployments
  • Advanced experience with Kubernetes (K8s) and Docker including cluster management networking and scaling in a production environment.
  • Proficiency in Terraform to drive consistency and automation across all infrastructure layers. Experience with Atlantis is a plus. 
  • Deep experience designing and maintaining complex pipelines (GitHub Actions GitLab CI or Jenkins) and mastery of scripting languages like Python Go or Bash.
  • Hands-on experience with modern monitoring observability and tracing stacks (Datadog Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).
  • Experience acting as an Incident Commander for high-severity outages and fostering a "blameless" post-mortem culture.
  • Demonstrated ability to influence executive leadership and collaborate cross-functionally with Product Engineering and Security teams.
  • Experience integrating AI-assisted productivity tools (Cline GitHub Copilot) into the engineering workflow to accelerate delivery.

Ways to "Stand Out":

  • Experience leading organizational platform migration including the development of rollback strategies stakeholder communication plans and post-migration validation
  • Prior experience working with high-velocity product-driven early-to-mid stage technology companies where reliability extensibility and availability were mission-critical to success
  • AWS or Kubernetes Certifications a plus -- but not in lieu of hands-on experience with the same within production environments
  • Notable contributions to Open Source projects or communities

Top Skills

Argocd
Atlantis
AWS
Bash
Datadog
Github Actions
Go
Kubernetes
Observe
Python
Terraform

What the Team is Saying

Candace
Isaac
Christina
Emilia
Noelle
Nadav
Am I A Good Fit?
beta
Expert contributor network
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
450 Employees
Year Founded: 2013

What We Do

TrueML makes financial technology that prioritizes customer experience and revolutionizes the experience of consumers seeking financial health. We’re a team of inspired data scientists financial services industry experts and customer experience fanatics creating experiences that serve people in a way that recognizes their unique needs and preferences as human beings and endeavoring to ensure nobody gets locked out of the financial system. After more than 10 years in business TrueML is excited to be expanding its footprint internationally. We are a growing geographically diverse team with employees in 30 U.S. states and 7 different countries with our key talent hub in LATAM. If you’re looking for an opportunity to do impactful work join TrueML and make a difference alongside hundreds of other inspired individuals.

Why Work With Us

Our functional teams are a diverse mix of employees from different backgrounds and geographies with each individual bringing unique perspectives and experiences that encourage increased innovation in our products and services. Join TrueML and make a difference alongside hundreds of other inspired individuals doing impactful work.

Gallery

TrueML Offices

Remote Workspace

Employees work remotely.

TrueML is excited to be a remote-first company with team members across the US Canada and several countries in LATAM (Mexico Argentina Dominican Republic and Costa Rica). Our teams frequently digitally collaborate & socialize across borders.

Typical time on-site:
US
Argentina (Remote Hub)
Mexico (Remote Hub)
Dominican Republic (Remote Hub)
San Francisco CA
Costa Rica (Remote Hub)
Learn more

Similar Jobs

TrueML

Staff Engineer

Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
Remote
United States
450 Employees
158K-219K Annually

TrueML

Senior Product Manager

Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
In-Office or Remote
San Francisco CA USA
450 Employees
143K-190K Annually

TrueML

Manager Quality Engineering

Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
Remote
United States
450 Employees
88K-117K Annually

TrueML

Operations Manager

Fintech • Machine Learning • Payments • Social Impact • Software • Financial Services
Remote
United States
450 Employees
118K-132K Annually
Apply Now

Date Posted

04/21/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0
142,000+ Jobs Tracked
12,400+ Companies
1,930 Categories