Senior Site Reliability Engineer at Censys

Censys knows the Cloud and the Internet better than anyone else. Attack Surface Management provides customers with an attacker-centric view of all externally facing Cloud and Internet assets to extend visibility, prioritize, and remediate the most critical risk exposures that will actually lead to a breach. Our daily IPv4 scans and the world’s largest SSL/TLS Certificate database enables customers with the most accurate and continuously updated attack surfaces. Enterprise security teams leverage Censys to keep pace with the speed of the business and gain an advantage on the rapidly evolving cyber-attack threats.

We are a rapidly growing cyber security startup based in Ann Arbor, Michigan with a 100% fully remote team. Our innovation is fueled by the team’s global perspectives and diverse backgrounds. We welcome healthy debate, constructive conversations, and outside-the-box thinking to ensure we are moving fast, learning things, and iterating quickly.

As a Senior Site Reliability Engineer on the Platform team, you will help design, build, and deploy systems used to empower our development teams and production applications. We’re looking for talented engineers to help grow our operational maturity as well as equally enjoy mastering cloud-native technologies to build tools and systems for service growth and reliability. As an SRE, you will be responsible for helping design infrastructure that is both reliable and resilient, deploy services to production, and work with development teams to ensure that applications can meet the demands of our customers.

What you will do:

Build and maintain Kubernetes clusters and other cloud-based infrastructure and services, such as ElasticSearch and Kafka, and ensure their performance, uptime, and availability.
Work with development teams to help them build, ship, and deploy services and applications with ease and confidence, and promote service resilience and reliability.
Help ensure smooth operations of our production environments, and work with developers to help debug complex issues as they arise. This includes capturing and monitoring the 4 golden signals.
Participate in a shared on-call rotation schedule. We believe in service end-to-end ownership and as such both development teams and SRE participate in on-call.

What we're looking for:

Required

Experience building, managing, and debugging Kubernetes clusters and Kubernetes-based environments, preferably in the cloud. This includes full-cluster life-cycle management such as version upgrades, API deprecation management, as well as optimizing and utilizing new features.
Experience deploying and managing applications in a Kubernetes environment with Helm, including creating Helm Charts from scratch.
Experience with Infrastructure-as-code Tools, such as Terraform and Ansible.
Experience with tools and solutions used to monitor the 4 golden SRE signals (latency, traffic, errors, and load), including Prometheus and OpenTelemetry.
CI/CD Pipeline Management, with a desire to achieve Continuous Deployment.

Preferred

Experience with Google Cloud Platform and/or AWS.
Experience with Service Mesh, such as Istio or Kuma.
Experience with managing and monitoring ElasticSearch clusters.
Experience with managing and monitoring Kafka clusters.
Familiarity and comfort with Linux-based environments.

Qualities

Have a passion for clean, concise architecture and enjoy working in a GitOps based environment.
Comfortable with projects that have a large degree of uncertainty and risk
Experienced distributed systems debugger who can identify root cause and remediate issues in production under pressure.
Desire to collaborate with and advise product management and leadership to balance long term maintainability of software against rapid development, as well as clearly communicate BCDR implications.
Understands and practices the principles of continuous delivery to ensure quick, safe, and sustainable development in the face of changing priorities and uncertainty

What will make you stand out:

Understanding of infrastructure operations, including switching and routing, and VPC design. We operate several data center environments across the globe in addition to our cloud infrastructure.
Cloud Governance and Policy management, including BCDR.
Not afraid to dig into code to better understand how our applications work to better facilitate testing, integration, and development environments, to help instrument metrics or improve service reliability.
Deep understanding of how to optimize and support web-based applications.

Our target salary range for this role is between $145,000 USD and $175,000 USD + bonus eligibility and equity.

We are located in Ann Arbor, Michigan, however we are open to hiring this position fully remote, with travel opportunities to meet customers and connect with colleagues.

Don't meet every single requirement? Studies have shown that women and people of color are less likely to apply to jobs unless they feel they meet every qualification. At Censys we are dedicated to building a diverse, inclusive, and authentic workplace - so if you're excited about this role but your past experience doesn't align perfectly with every listed requirement in the job description, we encourage you to apply anyways. You may be exactly who we need to fill this role or others!

Censys is an equal opportunity employer.

Senior Site Reliability Engineer

Company

Location

Type

Job Description

Explore More

Date Posted

Views

Similar Jobs

Senior Software Engineer - Legal Prompt Engineer - Thomson Reuters

Software Engineer, Legal Prompter - French Speaking - Thomson Reuters

Developmental Engineer -

Senior Applied Scientist - Legal Tech - Thomson Reuters

Software Engineer, Legal Prompter - Japanese Speaking - Thomson Reuters

Shift Supervisor Trainee - CVS Health