Staff Site Reliability Engineer

Globality, Inc. · Peninsula

Company

Globality, Inc.

Location

Peninsula

Type

Full Time

Job Description

Joel Hyatt and Lior Delgo founded Globality with a vision to create prosperous and healthy economies, companies, communities, and individuals. In this new era of the Autonomous Enterprise, Globality is on a mission to unleash productivity and purpose through autonomous sourcing and procurement. Leveraging our sophisticated AI, Globality empowers leading global companies to automate their purchasing processes and optimize how they spend their money – improving their profits, advancing their objectives, and extending their impact. Our customers love Globality. You will too.

The foundation of our culture is based off of our values: Trust, Collaboration and Innovation. Our goal is to create an environment where each person feels valued and experiences a natural sense of belonging. Not only have we been recognized for our transformational technology, but we’re also humbled to be recognized for the workplace culture we’ve built here. So we encourage you to bring your work and your life experiences. Bring your problem-solving skills, sure, but don’t forget your joy and passion. Bring the talent that makes you stand out but also bring the communities that ground and support you. We are a greater, more resilient world through the power of us.

Role Summary:

Site Reliability Engineers (SREs) are responsible for keeping all customer-facing services and other Globality production systems running smoothly as a unit within the broader Production Engineering team.

SREs are a blend of pragmatic technical operators and tooling craftspeople that apply sound engineering principles, operational discipline, and mature automation to our production environment and the Globality codebase. We are a DevOps-driven culture with a particular team interest in improving our product stack insight, automation tooling, and scalability.

Globality is a unique product stack which brings unique challenges – it’s a ground-breaking technology utilizing a world-class, AI-powered microservices platform that revolutionizes how businesses buy and sell services. The experience of our team feeds back into other engineering groups within the company, perpetuating product improvement. We are an open, inclusive, and diverse organization and our employees are at the heart of the great products we create.

As an SRE you will:

Be part of the team responsible for managing an enterprise-grade AI-driven data and messaging platform.
Protect the health of the Production environment.
Respond to Globality availability incidents and provide support for other customer-impacting incidents.
Work to prevent incidents from even happening.
Run our infrastructure with tools like Spacelift, Harness, and Kubernetes.
Help make monitoring and alerting alert on precursor symptoms and not on
Protect the health of the Production environment.
Document every action so your findings turn into repeatable actions…and then into automation.
Work with the QA/TestEng team to make the deployment process as efficient and boring as possible.
Design, build, and maintain core production infrastructure pieces.
Work with devs to implement the baseline technologies, policies, and practices to build a high-velocity, high-security, strong compliance platform that allows Globality scaling to support exponential growth.

Keep a keen eye on security issues in every project you work on, contributing to improving security in the systems that were already in place.

Debug production issues across services and levels of the stack.
Help plan the growth of Globality's infrastructure.

Establish strong relationships with other teams in order to positively influence them in their pursuit of automation and toil reduction, and to keep the rest of our team apprised of upcoming initiatives.

Protect the health of the Production environment.

You may be a fit to this role if you:

Think deeply about edge cases, points of failure, failure modes, and systemic behaviors.
Embrace a DevOps philosophy.
Know your way around Linux and the command line.
Feel comfortable working toward delivering an end-to-end seamless CI/CD pipeline, with a goal of delivering code into production as swiftly as possible, while working with the QA/TestEng and Infrastructure teams to ensure that code is production worthy.
Have strong programming skills – Python, Go, Rust, Ruby (etc.)
Maintain “production grade” adherence to best practices for the lowliest tools and scripts.
Embrace collaboration and are comfortable with communicating asynchronously.
Are driven to document, document, document so you don't need to learn (or teach) the same thing twice.
Have an enthusiastic, driven, go-for-it attitude. Are compelled to fix broken things and improve less-than-ideal things.
Have experience with Drone.io, Kubernetes, Ansible, or similar technologies.
Have experience using the advanced tools of AWS, GCP, Azure, or other cloud providers.

Projects you could work on:

Improve production infrastructure automation.
Improve Metrics collection scope / improve our metrics-driven Monitoring story.
Work with the QA / Test Engineering team to fully pipeline our internal tools.
Work with Test Engineering on scale testing initiatives.
Reduce the noise-to-signal ratio in our alerting.
Develop a relationship with a product group, define their SLOs, help analyze our metrics data on those SLOs and improve their reliability.

Leveling of Site Reliability Engineers at GlobalityAreas of expertise/contribution for up-leveling:Technical:

Use Ansible to efficiently manage our infrastructure
Further our "Infrastructure as Code" mission using Terraform and CI/CD-focused automation
Administration of a variety of high-availability clusters.
Firm grasp of Metrics and Monitoring systems and Grafana visualization.
Implementation, and delivery of well-targeted alerting with Slack/PagerDuty integrations.
Logging infrastructure (we use Loki / fluentbit)
Backend storage management and scaling
Disaster Recovery and High Availability strategy
Script / tool authoring
Knowledge of Globality product stack and service interoperations
Contributing to code in Globality

Execution:

Team organization and planning
Issue, Epic, OKR/KPI leadership and completion

Collaboration and Communication:

Creating blog posts / confluence articles
Completing Root Cause Analysis (RCA) investigations
Contributions to handbook, runbooks, general documentation
Leading and contributing to designs for issues, epics, KPIs
Improving team practices in handoffs of work and incidents

Influence and Maturity

Involvement in hiring process – developing/reviewing questionnaires, involved in interviews, qualifying candidates
Knowledge sharing, mentoring
Accountability, self-awareness, handling conflict in the team and receiving feedback
Maintaining good relationships with other engineering teams in Globality that help improve the product

Levels for Site Reliability EngineerSenior Site Reliability Engineer I/IITechnical:

Deep knowledge in 2+ areas of expertise and general knowledge of all areas of expertise. Capable of mentoring SRE-Is in all areas and other SREs in their area of deep knowledge.
Are able to design and build tools to improve the management of the production environment and/or infrastructure
Are able to contribute small improvement PRs to the Globality codebase to resolve issues

Execution:

Identifies significant projects that result in substantial cost savings or revenue
Identifies changes for the product architecture from the reliability, performance, and availability perspective with a data-driven approach.
Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make Globality cheaper to run.
Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

Collaboration and Communication:

Know a domain really well and radiate that knowledge through recorded demos, discussions in ProdEng design meetings, or Incident/Root-Cause Reviews
Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.

Influence and Maturity:

Set an example for team of SREs with positive and inclusive leadership and discussion on work.
Contributes to the hiring process by being part of the interview team to qualify SRE candidates
Show ownership of a major part of the infrastructure.
Trusted to de-escalate conflicts inside/with the team

Staff Site Reliability Engineer

Are Senior SREs who meet the following criteria:

Technical:

Able to conceptualize, design, and create innovative solutions that push Globality's technical abilities ahead of the curve
Deep knowledge of Globality and 4 areas of expertise. Knowledge of each area of expertise enough to mentor and guide other team members in those areas.
Contributes to Globality codebase to resolve issues and add new functionality
Significant modification to open source or major from-scratch tooling to deliver best-of-breed implementation of our production ecosystem.

Execution:

Strives for automation either by coding it or by leading and influencing developers to build systems that are easy to run in production.
Measure the risk of introduced features to plan ahead and improve the infrastructure.
Proposes and drives architectural changes that affect the whole company to solve scaling and performance problems
Leads significant project work for KPI level goals for the team

Communication and Collaboration:

Works with engineers across the whole company, influencing design to create features that will work well multi-region/multi-cloud, massive-scaling implementations
Runs RCAs and epic level planning meetings to get meaningful work scheduled into the plan

Influence and Maturity:

Writes in-depth documentation that shares knowledge and radiates Globality technical strengths
Has a high level of self-awareness
Trusted to de-escalate conflicts inside and outside the team
Routinely has an impact on the broader Engineering organization
Helps to develop other team members into more senior levels and leaders in the team

Senior Staff Site Reliability Engineer

Are Staff SREs who meet the following criteria:

Technical:

Able to lay out vision-level direction of tooling and solutions that push Globality's technical abilities ahead of the curve
Deep knowledge of the Globality product stack and 90%+ of areas of expertise.
Knowledge of each area of expertise enough to mentor and guide other team members in all areas.
Contributes to Globality codebase to resolve issues and add new functionality
Mentorship and coordination of major modification to open source solutions/tools or major from-scratch tooling to deliver best-of-breed implementation of our production ecosystem.

Execution:

Strives for automation either by coding it or by leading and influencing developers to build systems that are easy to run in production.
Measure the risk of introduced features to plan ahead and improve the infrastructure.
Proposes and drives architectural changes that affect the whole company to solve scaling and performance problems
Leads significant project work for KPI level goals for the team

Communication and Collaboration:

Works with engineers across the whole company, influencing design to create features that will work well multi-region/multi-cloud, massive-scaling implementations
Runs RCAs and epic level planning meetings to get meaningful work scheduled into the plan

Influence and Maturity:

Writes in-depth documentation that shares knowledge and radiates Globality technical strengths
Has a high level of self-awareness
Trusted to de-escalate conflicts inside and outside the team
Routinely has an impact on the broader Engineering organization
Helps to develop other team members into more senior levels and leaders in the team

The anticipated annual pay scale for this position is $140,000-$260,000. Actual salaries will vary depending on factors including but not limited to location, experience, and performance. The range listed is just one component of Globality's total compensation package for employees. This information is provided per the California Equal Pay Act. We are an equal opportunity employer and a participant in the E-Verify program. We believe diversity makes teams better and that discrimination based on race, gender, or anything else is self-defeating.

Explore More

site reliability engineers Jobs devops driven culture Jobs production infrastructure Jobs autonomous enterprise Jobs globality production Jobs More Jobs at Globality, Inc. Jobs in Peninsula

Apply Now

Date Posted

08/05/2023

Views

Back to Job Listings Add To Job List Company Profile View Company Reviews

Company

Location

Type

Job Description

Explore More

Date Posted

Views

Similar Jobs

Manager, Site Reliability Engineering - Zoox

Senior Staff Simulation Engineer - Wisk

Staff Data Engineer - AiDash

Senior Simulation Software Integration Engineer - Wisk

Support Engineer - Pricefx

Avionics Mechanical Engineer (Harness) - Reliable Robotics Corporation

Browse By Category

Browse By Location

Browse By Company

Free Tools

Popular Searches

Resources