Job

Job

Senior Site Reliability Engineer (SRE)

Job type:Contract
Town/City:London
County:London
Salary/Rate:inside ir35
Business Sector:IT
Job ref:CDI - 154112
Post Date:March 17, 2026

Senior Site Reliability Engineer (SRE)
London/Hybrid

12-month contract (high chance of extension)

Job description:

Join a global pioneer in the video game industry, shaping the future of digital entertainment for millions of players worldwide. As a Senior Site Reliability Engineer, you'll sit at the heart of a high-impact Technical Operations team, driving reliability, scalability, and performance across revenue-critical commerce platforms powering subscriptions and personalised experiences.

You'll collaborate closely with product and engineering teams, influencing architecture, improving deployment safety, and elevating observability-ensuring seamless experiences for a global gaming community.

This is a role where your decisions directly impact live services at massive scale.

Responsibilities:

Reliability & Engineering Excellence

  • Identify and eliminate systemic reliability risks
  • Define and evolve SLIs, SLOs, and error budgets aligned to user and business outcomes
  • Lead major incident response, post-mortems, and long-term remediation

Architecture & Scalability

  • Influence system design for high availability and resilience
  • Drive strategies for multi-region failover and disaster recovery
  • Balance performance, cost, and operational risk

CI/CD & Deployment Safety

  • Enhance pipelines to enable faster, safer releases
  • Implement modern deployment strategies (canary, blue/green, progressive delivery)
  • Build robust rollback and recovery mechanisms

Observability & Performance

  • Develop advanced monitoring across metrics, logs, and tracing
  • Improve signal quality and reduce alert fatigue
  • Troubleshoot and resolve performance bottlenecks

Infrastructure & Automation

  • Operate large-scale cloud-native, containerised systems
  • Build Infrastructure as Code solutions for resilient environments
  • Automate away toil and improve operational efficiency

Leadership & Collaboration

  • Mentor engineers and champion SRE best practices
  • Partner cross-functionally with engineering, product, and security teams
  • Drive a culture of reliability across the organisation

Experience:

  • 7+ years in Site Reliability Engineering, Production Engineering, or Systems Engineering
  • Strong expertise in distributed systems, including failure modes and fault tolerance
  • Proven experience operating cloud platforms (AWS, GCP, or Azure) in multi-region environments
  • Deep knowledge of Kubernetes and container orchestration at scale
  • Strong programming skills (Go, Python, Java) with a focus on automation and tooling
  • Hands-on experience building and managing CI/CD pipelines with safety guardrails
  • Demonstrated success leading high-severity incidents and driving systemic improvements
  • Excellent stakeholder management and ability to influence technical decisions

Preferred experience:

  • Multi-cloud or advanced resilience architecture experience
  • Familiarity with tools like Prometheus, Grafana, or Datadog
  • Experience with Terraform, CloudFormation, or similar IaC tools
  • Exposure to AI-assisted tooling for operations or observability

If you are interested in this role, please feel free to submit your CV!