Site Reliability Engineer

Site Reliability Engineer

  • Partly remote (2 days per week in the office) but can be flexible
  • (Preferably an hour commute) from Leeds, Crawley, or Worthing
  • Permanent, Full-time

Main Purpose of Job

The client is forming a brand-new site reliability team from the ground up that will be responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, capacity planning and make existing sites more reliable, efficient, and scalable.

This is an exciting opportunity to be part of and shape a new team specialized in systems, whether it be networking, reverse engineering Windows DLL, debugging storage latency, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems using cutting edge observability tools.

There are several new roles available, at different levels of experience.

Core Duties/Responsibilities

The successful candidate will share several responsibilities within the team, with a strong focus on getting to the root of production issues and ensuring they don't happen again. You must;

  • Work in close collaboration with software development teams to shape the future roadmap and establish strong operational readiness across multiple departments and applications.
  • Proactively identify systems that lack appropriate scaling, high availability, and stability - as well as provide immediate corrective action for incidents and ultimately recommend long-term resolutions.
  • Identify Service Level Indicators (SLIs) that will align the team to meet availability and latency objectives.
  • Be prepared for emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed.
  • Develop our metrics and improve observability, resulting in fewer outages and improved response to customer-impacting incidents.
  • Troubleshoot production issues across varying services and levels of the stack, be it network, storage, operating system, or application.
  • Complete root cause analysis (RCA) investigations and take ownership of issues utilizing end-to-end problem management methods.
  • Improve documentation all around, either in application documentation or in runbooks - explaining the why, not stopping with the what.

There will also be travel to regional sites, or offices as required. The successful candidate will need to be available out of hours for critical tasks and events. The role also requires the successful candidate to be prepared to accept on-call (rota based) availability.

Skills, Knowledge & Experience

Minimum Requirements:

  • Strong operational background.
  • Must have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
  • Debugging experience across different stacks - kernel, application, and network with tracing tools.
  • Work self-sufficiently but with an urge to collaborate and communicate asynchronously across teams.
  • Experience with .NET and Java application platforms.
  • Think about systems - edge cases, failure modes, behaviours, specific implementations.
  • Data informed mindset - you use data alongside your experience for problem analysis and resolution.
  • Strong experience in log analytics and observability platforms like ELK, New Relic, Grafana and ability to construct complex queries.
  • Experience in Windows, Linux, and network administration.
  • Low level knowledge of kernel, network, storage, compute, TCP/IP, authentication, and encryption.
  • Experience in mentoring and supporting colleagues.
  • Have good hands-on knowledge and experience with information security.

Desirable requirements:

  • Proficient in at least one object-oriented programming language.
  • Troubleshooting skills in Docker, Kubernetes, and service mesh such as Istio.
  • Ability to articulate complex issues to business stakeholders in general terms.

Additional Information

The following list of specific technologies and concepts will help you understand the primary technical focus of your position and the platforms deployed across public and private clouds and traditional infrastructure. A balance of your overall skills, knowledge and experience in any of the specific technologies listed will be used to assess your suitability as a candidate.

  • Operating Systems
    • Kernel Architecture
    • Kernel and I/O schedulers
  • Virtualisation
    • Hyper-V, VMWare, Xen
    • Software defined networks
    • Kernel based filtering systems
  • Application Delivery Controllers ('ADC')
    • F5 Big-IP (LTM, GTM, ASM, AFM & APM)
    • Nginx
    • Virtual ADCs
  • Networking
    • VXLAN
    • TCP
    • IPsec
  • Hardware
    • X86 Platform
    • SAN, vSAN
  • Software
    • Microsoft SQL Server
    • Oracle DB / MySQL
  • Storage
    • Block storage
    • Object storage
  • Automation
    • Understanding of configuration management tools such as puppet/chef
    • Understanding of CI/CD tools.
    • Understanding container and virtual machine orchestration platforms

Click APPLY NOW to begin your application!