Staff Software Engineer - Grafana Databases, Managed Services

Jobgether · UK

Company

Jobgether

Location

UK

Type

Full Time

Job Description

Team: IT

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Software Engineer – Grafana Databases, Managed Services in the United Kingdom.

In this role, you will operate at the intersection of large-scale distributed systems, streaming infrastructure, and cloud database platforms, helping power mission-critical observability services used globally. You will be responsible for the reliability, scalability, and performance of multi-cloud infrastructure that underpins high-throughput metrics, logs, and traces systems. Working in a deeply technical, remote-first engineering environment, you will influence architecture decisions while remaining hands-on in production systems. Your work will directly impact the stability and efficiency of large-scale data pipelines operating across hundreds of clusters. This is a high-autonomy role where you will partner with platform and database teams to solve complex distributed systems challenges. You will also play a key role in shaping operational excellence, reliability practices, and long-term system evolution across global infrastructure.

Accountabilities

In this role, you will take ownership of large-scale streaming and database infrastructure, ensuring reliability, scalability, and performance across hundreds of production clusters while driving architectural improvements and operational excellence.

  • Operate and evolve large-scale multi-cloud streaming and database infrastructure across production environments
  • Diagnose and resolve complex cross-layer failures involving storage, compute, networking, and control-plane systems
  • Design and implement safe rollout, upgrade, and migration strategies across distributed systems at scale
  • Improve observability, automation, and operational tooling to reduce system toil and increase reliability
  • Define and evolve SLOs, error budgets, and reliability standards for shared infrastructure systems
  • Partner with engineering teams to optimize query performance, data partitioning, and system scalability
  • Serve as a primary escalation point for high-severity incidents and lead deep root cause analysis efforts
  • Drive long-term architectural improvements to reduce systemic risks across multi-cluster environments
  • Mentor engineers and contribute to best practices in distributed systems engineering and operational excellence

  • Requirements

    You bring deep expertise in distributed systems, infrastructure engineering, or platform engineering, with strong experience operating high-scale production systems in cloud environments. You are highly technical, autonomous, and comfortable leading complex initiatives across global teams.

    • 8+ years of software engineering experience in SRE, platform engineering, infrastructure, or distributed systems roles
    • Strong experience with large-scale streaming or database systems (e.g., Kafka, Redpanda, ClickHouse, Cassandra, or similar)
    • Hands-on expertise with Kubernetes in AWS, GCP, or Azure environments
    • Proficiency in infrastructure-as-code tools such as Terraform, Helm, or similar
    • Strong programming skills in systems-oriented languages (Go preferred)
    • Deep understanding of distributed systems behavior, failure modes, and performance trade-offs
    • Experience with observability, incident response, and writing post-incident reviews
    • Strong knowledge of Linux internals, networking, storage systems, and cloud architecture
    • Proven ability to lead technical initiatives and influence architectural decisions without formal authority
    • Excellent communication skills with the ability to work effectively in remote, cross-functional teams

    • Benefits

      • Competitive compensation package including base salary, bonus (where applicable), and equity (RSUs)
      • Fully remote-first working model with global collaboration across distributed teams
      • 30 days annual leave, including designated shutdown days for full disconnection
      • Equity ownership in the company’s long-term success through RSU participation
      • Access to modern AI development tools with company-supported usage budgets
      • Strong emphasis on autonomy, trust, and outcome-driven engineering culture
      • Career growth opportunities in a fast-scaling global infrastructure organization
      • Exposure to cutting-edge distributed systems and large-scale observability platforms
      • Inclusive, transparent, and highly collaborative engineering environment
Apply Now

Date Posted

04/15/2026

Views

0

Back to Job Listings Add To Job List Company Profile View Company Reviews
Neutral
Subjectivity Score: 0

© 2026 Job Transparency. All rights reserved.