The ROAD to SRE

Photo by Jaromír Kavan on Unsplash

There are many ways to introduce Site Reliability Engineering practices to your organisation, but it can be confusing where you should start. Do you start with introducing Service Level Objectives (SLO’s) into your organisation so that you can measure what matters most to your customers? But you need observability (o11y) right? This gives us visibility into our code, highlighting bottlenecks, errors and bad designs as well as allowing us to build Service Level Indicators (SLI’s) for our SLO’s. But what happens if you do have a customer facing incident, surely you need a tight incident response process so that we can recover or remediate quickly? Or do you ensure you can deploy both code and infra repeatedly and rapidly? The answer is all of these! But what do you work on first? And how do we know what areas need to level up?

I work in a well established SRE team for a cloud native company where our goal is to help improve product reliability and stability. We achieve this in a number of ways by providing coaching and guidance to our engineering team to improve service observability, infrastructure automation and architecture. This year we began to plan out what the landscape of the SRE practice could look like in the future and it soon became clear that we had difficulty clearly articulating what the guiding principles of Site Reliability Engineering and what it meant for our team. Without a clear tangible definition of Site Reliability Engineering we could be in danger of turning into another ops or platform team.

What we needed was a set of clearly defined principles that would guide us in defining key competencies. Now, there is no shortage of resources out there, you only have to google SRE and you get 43,000,000 results. Yikes! In the past we leveraged the many videos and blog posts from google. However we found quickly that not everything could be applied the same way. We were not Google’s scale, nor did we have that kind of workforce resource available as we run 1 (sometimes 2) SRE’s per Business Unit. So we set out to define what “SRE means to us” and define our core principles of our SRE practice. In this article I will discuss the 4 key principles that we use to guide what define as SRE in terms of our organisation and provides clarity on areas we need to focus on. In subsequent articles we will break down each of our principles into competencies that form our “SRE skill tree”.

Defining SRE

First we defined Site Reliability Engineering, as “a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems to create scalable and highly reliable software systems”. This sounds great, but it actually doesn’t describe what the SRE team does. To address this gap, we defined 4 key principles that would form the focus of how we would build out our SRE practice. The principles are broken down into the following areas of Response, Observability, Availability and Delivery. And when stacked together, form the ROAD to SRE (pun intended).

Principle 1: Response — The effective response to failure events through incident processes, runbooks and disaster recovery events.

Why is this important? Well, services and infrastructure can fail at any time, and often during the early hours of the morning, so we need a robust response process for handling an incident. And when they do occur, we should have an effective remediation process so that we continue to learn and avoid repeating the same mistakes. This principle is broken down into the following 3 subsections:

  • Measure: Tracking key metrics like MTTD (mean time to detection), MTTA (mean time to acknowledge), MTTR (mean time to resolution) and MTBF (mean time between failure) to identify improvement opportunities in our process.
  • Remediate: Improving how quickly we can fix issues when they arise with clear runbooks or automation or better yet, prevent them in the first place. This also includes blameless post mortems on incidents to apply learnings gained.

Principle 2: Observability — The ability to understand the health of services and surface events through telemetry that impact the availability and performance of product services.

I probably do not need to preach why good observability (o11y) is important, but without observability you are flying blind during an incident, at best you can only make educated guesses based on inferred knowledge and domain experience. Who wants that stress at 3 am in the morning?

Instrumenting observability into your services is critical and gives you the ability to ask arbitrary questions of service and pinpoint performance problems much quicker. Being able to quickly find a root cause of an error makes for a much happier development and SRE team. This principle is broken down into the following 3 subsections:

  • Monitor: Ensuring that we monitor not only what we instrument but also ensuring we have synthetics, and ideally that can test our SLI/SLO (more later on this).
  • Alert: Ensure that all alerts are actionable and ideally linked to an SLO.

Principle 3: Availability & Reliability — Product services are both available (able to fulfil its intended function) and Reliable (confidently service requests) as governed by the Service Level Objective (SLO) and measured by the Service Level Indicator (SLI).

Understanding how your services handle failure (gracefully or not) and understanding it’s dependencies helps inform the availability of your services. This can guide how many 9’s your SLO can promise to the customers or if we need to halt feature work and focus on making our services more resilient. This principle is broken down into the following 3 subsections:

  • Performant: Understanding our system bottleneck and capacity to better support Service Level Objectives.
  • Failure Management: Identifying and testing the failure scenarios of our workloads to improve resiliency.

Principle 4: Delivery — Consistent provisioning and deployment of product services and operational dependencies such as Infrastructure as Code (IAC), Continuous Integration & Delivery (CI/CD) and Requests for Change (RFC’s).

The Build, Provision and Deployment of our services are repeatable, not only as an outcome of an automated build pipeline, but also that they are reproducible, meaning that we can verify the path from code to binary is clean from any malicious code. We also want to enable the Product teams to deliver faster by progressively delivering features to production using techniques such as blue/green and canary deployments. This principle is concerned with providing the capability to go fast, deploy rapidly but ensure it is repeatable and secure. Again, this principle is broken down into the following 3 subsections:

  • Provision: All infrastructure and configuration is defined as code for consistent and repeatable provisioning of cloud infrastructure.
  • Deploy: All services deployments are automated and use progressive deployment strategies like canary and blue/green deployments.

Conclusion

By establishing a set of core principles (Response, Observability, Availability and Delivery) aka our “ROAD to SRE”, we now have clarity on what areas we expect our SRE team should be focusing on and avoiding a common pitfall of becoming another platform or Ops team. We also now have a yardstick that we can use to measure progress, highlighting new capabilities and identifying the areas we are lacking. In the upcoming articles we will begin to flesh out how we break down each subcategory of our principles into competencies, in the form of a skill tree, and map them to the 5 phases of organisational reliability of Reactive, Proactive, Strategic and Visionary.

Acknowledgements

Co-creator Marc Armstrong and reviewed by Lee Campbell and Rhys Campbell.