Job Description
IMPORTANT: Please be aware scammers may try to impersonate Zello by reaching out regarding job opportunities. We will never ask you for bank account information checks or other sensitive information as part of our hiring process. All correspondence will come from the zello.com email domain. If you’re unsure please email [email protected] with questions.
About ZelloZello is a voice-first communication platform powered by our industry-leading push-to-talk technology to improve collaboration and productivity for desk-less workers. With over 175+ million users we’re the #1 rated push-to-talk app in the world delivering 9 billion (yes with a B) messages a month.
At Zello our company values are at the heart of what we do everyday. We’re proud to serve the frontline we’re privileged to connect people in times of crisis across the globe and we’re honored to support first responders.
And this is where you come in.
We're seeking a Senior Site Reliability Engineer who can own our data tier at high availability while also pulling weight across the broader platform. As Zello scales the line between "database problem" and "platform problem" keeps blurring. We want someone who can sit on either side of it. This hire owns our data tier reliability (MySQL MongoDB ScyllaDB Elasticsearch Redis) and contributes to monitoring on-call and our ongoing cloud modernization efforts.
About ZelloZello is the leading push-to-talk communication platform enabling instant voice communication for frontline workers across hospitality logistics transportation construction and public safety. When a hotel manager radios housekeeping or a trucker calls dispatch they're on Zello — and they need it to work every time. The Platform team builds and operates the infrastructure that makes that possible. Databases sit at the center of that promise: every channel every message every login depends on them.
The RoleYou'll join the Platform team and report to the Director of Platform Engineering. You'll own the reliability of our MySQL and MongoDB footprint across Google Cloud work alongside application engineers on performance and schema decisions and contribute to the broader platform observability with Prometheus Loki and Tempo; on-call; incident response;. This role suits someone who likes operating real production systems doesn't get stage fright in incidents and writes the runbook for the next person who hits the same problem.
We're investing in AI to compress incident response build agents and tooling that speed up root-cause analysis and lift developer productivity across engineering. We want someone curious about what that looks like for an SRE and excited to help shape it.
Operated Zello's MySQL and MongoDB clusters to documented availability targets with automated backups regularly tested restores and failover the on-call team trusts under real incident pressure.
Cut latency or capacity cost on at least one critical database workload through measurable performance work — index strategy query tuning schema changes or sharding.
Extended our Observability coverage so incidents are diagnosed in minutes rather than hours with dashboards and alerts the team actually uses.
Owned a slice of the Platform on-call rotation and led postmortems that turned recurring incidents into permanent fixes.
Design deploy and operate highly available MySQL and MongoDB clusters across our cloud environments; replication sharding backups point-in-time recovery upgrades and disaster recovery.
Tune query performance schema and index strategy in partnership with application engineers and push fixes upstream into the application when that's the right answer.
Extend our observability stack — Prometheus Loki and Tempo — so the data tier is as well instrumented as the application tier and traces actually reach the root cause.
Participate in the Platform on-call rotation lead incident response for data-tier issues and write postmortems that drive durable change.
Improve disaster recovery security posture and compliance for our database footprint — encryption access control audit logging backup integrity.
Evaluate and operate ScyllaDB/Cassandra and Elasticsearch where they fit the workload and bring an opinion on when they don't.
Write the automation tooling and operators that take repetitive work off the team's plate.
Use AI to compress incident response and root-cause analysis; building agents automation and developer-enablement tooling that scale the team's reliability work
You've operated highly available MySQL and MongoDB in production at scale; replication sharding backups point-in-time recovery and failover drills you've actually run not just designed on paper.
You diagnose database performance end-to-end; query plan indexes locking OS storage network — and can point to specific incidents where you found and fixed root cause that others had missed.
You've shipped meaningful work on at least two of bare metal Linux containerized workloads (Docker Kubernetes or similar) and a major cloud (GCP preferred; AWS or Azure equivalent is fine).
You instrument what you build. You've used Prometheus OpenTelemetry or comparable systems to close real incidents and you've written the dashboard the next on-call engineer will actually open.
You write code that runs in production: Python Go Bash or similar for automation tooling or operators. You don't hand off scripting to someone else.
You communicate clearly under pressure and after the fact. Your postmortems are blameless specific and lead to changes that stick — and the people you've worked with describe collaborating with you as straightforward.
You bring an opinion on managed vs. self-managed databases and can defend the trade-off based on availability cost and operational burden.
7+ years in SRE DevOps platform infrastructure or database reliability roles with at least 3 years owning production databases.
BSc in Computer Science or equivalent practical experience.
ScyllaDB/Cassandra or Elasticsearch experience is a plus
You've used AI tooling: copilots agents or custom automation to expedite incident response root-cause analysis or developer workflows.
We hire for potential passion for our mission and a knack for solving difficult problems over checking every qualification box. We have competitive pay equity with significant upside and intentionally design our benefits to encourage healthy and well-balanced employees flexible schedules and time off. We even offer a sabbatical after every five years of service so you’re able to pursue and enjoy what matters most to you. And of course we wouldn’t be a technology company without a ping-pong table and free snacks in our break room. Join us!
Zello provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race color religion age sex national origin disability status genetics protected veteran status sexual orientation gender identity or expression or any other characteristic protected by federal state or local laws.
All Zello personnel are required to comply with defined security privacy and compliance requirements applicable to their role along with requirements that are applicable to all Zello personnel.
Skills Required
- 7+ years in SRE DevOps platform infrastructure or database reliability roles
- 3+ years owning production databases
- BSc in Computer Science or equivalent practical experience
- Experience with MySQL and MongoDB clusters
- Experience with Docker and Kubernetes
- Experience with ScyllaDB/Cassandra or Elasticsearch
What the Team is Saying

Zello Compensation & Benefits Highlights
- Healthcare Strength—Available information suggests employer-paid employee health coverage is a core strength with dental and vision included. Mentions of HSA contributions and established carrier plans reinforce the robustness of medical benefits.
- Leave & Time Off Breadth—Available information suggests open/unlimited PTO combines with paid sick days company holidays and a multi‑week paid sabbatical after five years. This breadth indicates meaningful flexibility for short breaks and extended rest.
- Parental & Family Support—Available information suggests paid parental leave is offered for both primary and secondary caregivers from the start of employment. This indicates tangible support for family needs during significant life events.
Zello Insights
What We Do
We started as a company that turned phones into walkie-talkies. Today we modernize instant voice communication with our industry-leading push-to-talk technology to help mobile workers meet quickly changing urgent real-world challenges. We have the highest-rated walkie-talkie app with over 8 billion messages sent per month and 170 million users in industries such as transportation retail construction hospitality healthcare and more. We’re proud to serve frontline workers we’re privileged to connect people in times of crisis across the globe and we’re honored to support first responders. As demand for our app continues to rise we’ve evolved from a startup to a scale-up — and we’re still growing rapidly which is where you come in.
Why Work With Us
If you strive to work on technology with purpose technology that actually changes how people communicate and work then come talk to us. We like people who take pride in their work and deliver with consistency and quality. We're collaborative sometimes serious sometimes not but we're all in 110%.
Gallery
Zello Offices
Hybrid Workspace
Employees engage in a combination of remote and on-site work.
Zello is a hybrid workplace where Austin employees typically work in the office on Tuesdays Wednesdays and Thursdays.
Explore More
Date Posted
05/13/2026
Views
0
Similar Jobs
Staff Software Engineer, Backend (Communications Platform) -
Views in the last 30 days - 0
View Details