Site Reliability Engineer - Toronto, San Francisco, New York
Job Description
Cockroach Labs is the team behind CockroachDB, the most highly evolved cloud-native, distributed SQL database on the planet. We created CockroachDB and our self-service, fully managed cloud offerings of CockroachDB (Dedicated and Serverless) to deliver the ability to build and scale apps with fewer obstacles, more freedom, and greater efficiency. Today, Cockroach Labs helps companies of all sizes—and the apps they develop—scale fast, survive disaster, and thrive everywhere. Join us on our mission to make data easy and help developers build what they dream of without having to worry about their database ever again.
About the RoleCockroachDB is the backbone of storing global services. As a Site Reliability Engineer, you’ll help manage and scale our CockroachDB Cloud services, which span multiple cloud providers. You will oversee our production systems, ensuring that we can provide stable and scalable infrastructure as we deliver CockroachDB to our customers. Roughly half of your time will be spent on greenfield development work, with an emphasis on developing tooling and driving automation. In this role, you will work across multiple teams within CockroachDB Cloud as well as development and product teams working on CockroachDB.
You Will- Manage the infrastructure for cloud services, including running internal production systems and hosting CockroachDB for our external customers.
- Design, write and deliver software and systems to increase product reliability and operational efficiency.
- Develop custom tools as necessary.
- Keep a complex system running and solve problems relating to mission-critical services.
- Design, implement, operate, and troubleshoot the automation and monitoring of production clusters to maximize performance and availability.
- Drive the company through disaster recovery tests, where we manually turn down pieces of CockroachDB to test its overall resilience to failures.
- Participate in an on-call rotation for our production systems and hosted services.
In your first 30 days, you will onboard and be exposed to our current internal and customer-facing production systems. Working with our existing SRE and engineering teams, you will pair on production operations and build out runbooks for the operation of different systems. We believe that it's essential for you to take this first month to become familiar with our technology and our company.
After 3 months, you'll be fully integrated into the team. You will develop and own tooling for reliability, automation, and other issues related to CockroachDB Cloud’s stability and scalability. You will identify new opportunities for automating processes, streamlining delivery, deploying new core functionality, and building great tools. You will help make CockroachDB Cloud the best platform to host CockroachDB on by bringing your expertise to our database.
You Have- Expertise in analyzing, monitoring, and troubleshooting large-scale distributed systems.
- Experience in software development using one or more of the following: Go, C, C++, Python, Java.
- Proficiency in working with algorithms, data structures, and production troubleshooting.
- Expertise in working with major cloud providers (AWS, Azure, GCP, etc.) and Cloud APIs.
- Debugged and optimized code to automate routine tasks.
- Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc.)
- Previous on-call experience, with a sense of urgency.
- Experience building collaborative relationships with your colleagues. You enjoy being part of the code review process and partnering with your teammates on complex problems.
- Ideally 5+ years of experience. Â
Tom Schmidt - Site Reliability Engineering Manager
Tom recently joined Cockroach Labs as manager of Site Reliability Engineering and has taken responsibility for Cockroach Cloud’s production operations. Tom joined Cockroach Labs after 15 years at IBM where he initially contributed in a wide variety of technical leadership roles, generally focussing on quality and automation across compiler development, test frameworks, CICD, and more. Over the past 5 years, Tom has become an enthusiastic advocate of the Site Reliability Engineering discipline, presenting on the topic at conferences, developing certification curriculum, and securing multiple patents. Tom was also a primary contributor towards the establishment of IBMs formal SRE profession and was recognized as one of the first three SRE Thought Leaders within the company. Most recently, Tom transitioned into a management role where he introduced Site Reliability Engineering to the IBM Business Analytics organization, building an SRE team from the ground up, eventually managing over 20 individuals across 3 unique project areas while establishing practices that now guide over 80 engineers internationally. Cockroach labs presented a new and unique opportunity to gain experience in a high paced startup environment, laying the foundation for scalable reliability as we prepare for the rapid growth of our Cockroach Cloud offering. Beyond the business, Tom is blessed to call himself a proud father of a 2 year old boy, and otherwise enjoys finding a balance between spending time in nature (hiking, camping, exploring) and testing his mettle in competitive gaming.
Yandu Oppacher - Director of Engineering
Yandu works across multiple parts of CockroachCloud to ensure that our infrastructure and teams are robust and scalable. Yandu joined Cockroach Labs after nearly 8 years at Shopify where he started on the data platform team and helped it grow from 4 DB nodes to several hundred Hadoop nodes running over petabytes of data in Google Cloud. In his last 2 years at Shopify, he led the Production Engineering teams responsible for all of the compute runtime resources that power Shopify’s mission critical services. Joining CockroachCloud and Cockroach Labs allows him to get back to his first love, Databases, while applying his Production Engineering skills to help build our DBaaS platform. Outside of Cockroach Labs Yandu will be found reading or, more likely, chasing after his 3 young kids and exploring the outdoors with them.
Our Benefits- Competitive Health Insurance Coverage (for you & your dependents!)
- Paid parental leave (with baby bucks)
- Flex Fridays
- Flexible time off & flexible hours
- Education reimbursement
- Relocation support or home office allowance
Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse and inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at [email protected].
The annual anticipated base salary range for U.S. candidates for this role is USD $165,000 to $225,000 plus commission if a sales role. We set standard ranges for all U.S.-based roles based on function, level, and geographic location, benchmarked against similar stage growth companies. In order to be compliant with local legislation, as well as to provide greater transparency to candidates, we share salary ranges on all job postings regardless of desired hiring location.  Actual salaries may vary and fall outside of this range depending on factors such as a candidate’s qualifications, geographic location, skills, experience, and competencies. In addition, we are often open to a wide variety of profiles, and recognize that the person we hire may be less experienced (or more senior) than this job description as posted. Salary is one component of the Cockroach Labs’ total rewards package, which includes stock options, health insurance, life and disability insurance, funds towards professional development resources, flexible PTO, paid holidays, and parental leave, to name a few! Salaries for candidates outside the U.S. will vary based on local compensation structures.
Date Posted
05/20/2023
Views
14
Similar Jobs
Senior Software Engineer, Devices Automation - Block
Views in the last 30 days - 0
Square a company that has evolved since its inception in 2009 is seeking a Software Engineer with extensive experience in embedded devices and test en...
View DetailsIT Support Engineer (Contract) - Informa
Views in the last 30 days - 0
Curinos a company with decades of expertise in the financial services industry is seeking an IT Support Engineer for their New York office The role in...
View DetailsEngineer, Quality Assurance – BBU (EQA1) - JMA Wireless
Views in the last 30 days - 0
JMA is a leading company in wireless technology particularly in 5G with its advanced softwarebased platform manufactured in Syracuse NY The companys t...
View DetailsStaff Editor, Current Events - Dotdash Meredith
Views in the last 30 days - 0
The Staff Editor role involves coordinating crossplatform content across multiple verticals managing daily and breaking news and writingediting storie...
View DetailsSoftware Engineering Lead - Dotdash Meredith
Views in the last 30 days - 0
Dotdash Meredith is seeking a skilled Engineering Lead for a missioncritical role in designing and scaling their nextgeneration publishing platform Th...
View DetailsBusiness Account Executive - Spectrum
Views in the last 30 days - 0
The Business Account Executive role involves selling primary and ancillary communications solutions to small and mediumsized businesses within a speci...
View Details