I saw two emotionally charged opinions on Twitter this week about SRE and Operations. They really made me think.

The first tweet I saw was from @tmclaughbos — where he makes the impassioned case that doing a Kubernetes (EKS) migration is not an SRE activity.

The thing that really made me think was Tom saying “That sounds like just an operations team.”

I am a Site Reliability Engineer at Google, I have run some of the worlds largest applications at scale since 2011, and Site Reliability Engineering does Operations.

Later that week I saw Mar⚡️us write about the dev/ops split in SRE going badly. Mar⚡️us is referring to the principle that an SRE team should spend less than 50% of its time doing ‘Toil’ or repetitive work (SRE Book Chapter 5: Eliminating Toil, “Why Less Toil Is…


I did a call out on Twitter to ask about what I should write about. Two of the responses resonated with me.

These are things I’ve been thinking about a lot because I’m just about launch into a new role once I get back to London after my vacation, where I will join the new Google Customer Reliability Engineering team. A large part of the job of that team is taking SRE principles and convincing those who have never experienced them before to apply them.

The opinions stated here are my own, not those of my company.

Getting an organisation to accept SRE principles starts with measuring your customer’s experience. How well is your critical application performing right now? Is it broken? Is it up? How often is it down? …


How do you release software in a safe way, with reliability in mind? How do you bring together your development process with SRE practices for getting that software out to your customers, without introducing unneeded complexity or fragile systems.

Image for post
Scene from Google Cloud Next London 2017

I have been giving commentary on the SRE book, which can be found online. Release Engineering is Chapter 8. As per a request from @asatarin one of my twitter followers I am skipping ahead to here.

A lot of the content in this chapter is not a manual for how to do good releases, but rather describing what Google does. …


Service Level Objectives,or SLOs are the fundamental basis of all Site Reliability Engineering. Without them you can’t have error budgets, prioritize development work, or do timely and effective incident management.

Objectives in Practice

This is from Chapter 4 of the SRE book: Service Level Objectives.

Start by thinking about (or finding out!) what your users care about, not what you can measure. Often, what your users care about is difficult or impossible to measure, so you’ll end up approximating users’ needs in some way. However, if you simply start with what’s easy to measure, you’ll end up with less useful SLOs. …


How well is your system working, right now?

Do you even know how well your system is working? How many errors are you serving? Would you even know if 1% of your requests time out?

Service level indicators are literally the most important piece you need in order to apply SRE principles. Even if you think you have them, they might not be high quality enough for you to accurately gauge your customer’s experience, and they will mislead you.

Indicators in Practice

This is from Chapter 4 of the SRE book: Service Level Objectives.

Given that we’ve made the case for why choosing appropriate metrics to measure your service is important, how do you go about identifying what metrics are meaningful to your service or system? …


Planned outages can make systems at Google more reliable.

In Embracing Risk, I wrote about how by making sure a system would be unavailable on a regular basis, then you can be sure that people know how to cope when it’s down. The context there was a system that management judged was only deserving of 99% reliability.

Image for post
A system that needs a lock service.

Chubby is a system at Google at the other end of things. An example of an open source lock server is Apache Zookeeper.

Imagine that chubby runs with an SLO of 99.99% — this is almost 13 minutes downtime allowed per quarter.

Fortunately, this system is actually so amazingly stable and well behaved they often are much more reliable than that. …


Definitions of what a SLI and an SLO are, and talking about how to define one.

I am a Site Reliability Engineer at Google, annotating the SRE book on medium. The opinions stated here are my own, not those of my company.

This is from Chapter 4: Service Level Objectives.

Service Level Objectives

Written by Chris Jones, John Wilkes, and Niall Murphy with Cody Smith
Edited by Betsy Beyer

I worked with Chris for 3 years.

It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. …


Error budgets represent the amount of failure we expect to actually have.

I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.

This is commentary on the last section of Chapter 3: Embracing Risk.

Motivation for Error Budgets

Written by Mark Roth
Edited by Carmela Quinito

Other chapters in this book discuss how tensions can arise between product development teams and SRE teams, given that they are generally evaluated on different metrics. Product development performance is largely evaluated on product velocity, which creates an incentive to push new code as quickly as possible. Meanwhile, SRE performance is (unsurprisingly) evaluated based upon reliability of a service, which implies an incentive to push back against a high rate of change. Information asymmetry between the two teams further amplifies this inherent tension. …


How to decide how fault tolerant you really want to be and defining the value of reliability.

I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.

This is commentary on the second part of Chapter 3: Embracing Risk. Written by Marc Alvidrez, edited by Kavita Guliani.

Risk Tolerance of Services

What does it mean to identify the risk tolerance of a service? In a formal environment or in the case of safety-critical systems, the risk tolerance of services is typically built directly into the basic product or service definition. At Google, services’ risk tolerance tends to be less clearly defined.

Inside Google, identifying risk tolerance is getting better. A centralized dashboard showing system reliability thresholds has been very useful. This has the added benefit that it’s a “living document” — it shows the reliability goals of systems, as well as their current behavior! …


In-order index of all my published articles on the SRE book.

I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.

Chapter 1: Introduction

Chapter 2: The Production Environment at Google, from the Viewpoint of an SRE

Chapter 3: Embracing Risk

Chapter 4: Service Level Objectives

Chapter 8: Release Engineering

About

Stephen Thorne

Stephen is a Staff Site Reliability Engineer at Google, where he works on the Google Cloud Platform.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store