Building Blocks for Site Reliability Engineering

15 min readAug 29, 2018

[This is a transcript of the above talk that I gave in 2016. I was requested to give an introduction to site reliability engineering for a technical audience that is unfamiliar with the concept of SRE.]

Before we can talk about the building blocks of site reliability engineering, we have to talk about what site reliability engineering is.

I was at SRECon in Dublin in 2016, which tells you one thing about site reliability engineering: It’s big enough to have its own USENIX conference. But also at SRECon there was a panel discussion entitled “What is SRE?” — and there was one question that was not answered during that entire discussion, and that question is: “What is SRE?” So in that sense the discussion failed its goal.

What is SRE?

If you look at Google’s own material on the topic, you get various sound bites about “What is SRE?”

There’s one from Ben Treynor, Google’s VP of Operations: “SRE is what you get when you treat operations as a software engineering problem.” And there is one from Andrew Widdowson, who runs our education programs for SRE; he says: “Our work is like being part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100 miles per hour.” There’s another one that says “SREs engineer services instead of binaries.”

Me and a couple of co-workers have been trying to come up with a with brand statement for Site Reliability Engineering, and the best we could come up with is: “Site Reliability Engineering as a specialised job function that focuses on the reliability and maintainability of large systems.”

So SRE is a job function that is specifically geared towards reliability. Google pretty much invented the term, which means that we also get to define it. But talking with people in other companies, it turns out that how companies implement site reliability engineering is wildly different. There are various different engagement models for how you can implement this job function: I’ve seen companies that have an embedded site reliability model where one or two site reliability engineers are embedded into product engineering teams. There are companies that don’t really have SRE at all: they have something like production engineering that do not engage with the product at all, but rather write infrastructure and platforms for achieving reliable systems.

What is SRE … at Google?

I’m just going to talk about how Google implements implement SRE. The way Google implements SRE is the following:

SRE at Google is its own department. It’s part of Google’s technical infrastructure organisation. SRE has less than 10% of the engineering staff of Google; historically it’s been five to 10 percent of the engineering staff of Google. Individual SRE teams partner with product engineering teams in a co-ownership model of the services. Usually it’s a one-to-many mapping, so you will have one SRE team that is responsible for working with a number of services.

The size of the SRE teams compared to the size of the product engineering teams also varies widely. The best I’ve ever experienced had a ratio of one to five between SREs and product engineers. That was incredible — you basically knew everybody by name. But I’ve also been in teams that had a factor of 1 to 50 between site reliability engineers and product engineers.

One of the things about SRE that makes it hard to implement in the Google model is that SRE teams come in a minimum size. If you have an SRE team in one single location, you need at least eight people. If you have an SRE team that is split across two locations, then we try to have at least six-plus-six: six people in one location.

The reason for that is that SRE does emergency response for the services supported by a team, and we try not to burn our people out on emergency response. If you want to have a sustainable rotation for emergency response, you need a certain number of people: you don’t want people to be on call all the time, or one out of every two or three weeks. So we aim to have at least an eight or six-plus-six model.

This also means that SRE teams are very often not co-located with the product engineering teams. In a single site model, you can still achieve that, but as soon as you go to multiple sites, and product engineering is in a single site and site reliability is in at least two, then you’re not going to have colocation. For a lot of product engineering teams this is a big change in approach: They’re used to having everybody they interact with be down the corridor from them, and suddenly they need to interact with people that are in a different time zone.

SRE recruits from both software engineering and systems engineering backgrounds, and there are no organisational barriers between product engineering and SRE for software engineers. If you get hired by Google as a software engineer, you can freely move between product engineering and site reliability engineering.

The other background that SRE recruits from is systems engineering or systems administration. We basically recruit on the entire spectrum between software engineering and systems engineering, so we also recruit a lot of people with a mixed background: People that maybe don’t quite meet the software engineering bar at Google, but make up for this by being really good systems engineers.

The reason why we recruit for this background is that there’s a mix of operational and engineering work in SRE. There is a cap on operational work for individuals and SRE teams—the official cap is 50 percent operational load. This is where we pull the emergency brake and offload operational work to the product engineering teams. Healthy teams have a much lower operational load than 50%: usually we aim for 10 to 15 percent operational load.

What about DevOps

What about DevOps? That’s a term that very often gets get thrown into the mix once you start talking about site reliability engineering. I’ve noticed that Google has this institutional blindness when it comes to the term DevOps. Because we look at the DevOps principles, and we look at the DevOps tool chain, and we look at the DevOps lifecycle, and we tilt our head and go: “Wait, isn’t that how you do things?”

Because Google has no traditional IT, Google does not have Ops. All of Dev at Google is DevOps. We simply do not operate in a model where we have traditional IT, so pretty much all of the DevOps toolchain is stuff that we do anyway. A large part of the DevOps tool chain is covered by site reliability engineering—especially when we’re talking about automation, monitoring, or releases.

How do you get a reliable service?

That being said, let’s look at the building blocks: let’s look at how you get a reliable service.

I said that site reliability engineering is a specialised job function focused on reliability. That means that SRE is likely to engage in fields that make the biggest difference to the reliability of a system. And these are basically the building blocks for SRE: What do you need to do if you want to get a reliable service?

Monitoring and Alerting

First of all: if you want to have a reliable service, you need to know how reliable it actually is. That means you need monitoring: you need something that tells you: “How reliable is my service at this point in time?” Same for improving reliability: unless you know how reliable it is, you have nothing to improve, and unless you have alerting, you can’t actually step in when things go wrong—unless people call you on the phone and tell you that your site is broken. But you don’t really want to get to that to that point—you want to know that your site is broken before your customers notice.

All of these things make monitoring and alerting a very attractive target for SRE, so SRE typically writes the monitoring for their own services, writes the instrumentation, does blackbox monitoring for the services, reviews and tunes alerts and so on—SRE covers the whole telemetry and instrumentation part.

Service Level Objectives

What else do you need? You need service level objectives (SLOs.) Monitoring tells you how reliable your service is, service level objectives tell you how reliable you want it to be. Service level objectives serve as the goal for the SRE team to strive for, and also as a reference point for how much we actually want to invest into reliability.

A couple of words on service level objectives: typically SLOs should be set on customer expectations. One trap that you very often you fall into when you try to set up an SLO is that you look at how reliable the system is at that point in time, and you say, “Well, that sounds like a number we can reach”—and then you set that as your SLO. And two years later you have forgotten where that number comes from, and you’ll be running a four nines availability service—and maybe your reliability is degrading and you’re starting to invest lots of effort into improving reliability.

And then you look at your customers, and you realize: “Oh wait, they actually won’t notice if we drop a nine or two”, and all this work that we’re doing is really for nothing. So you need to set them based on customer requirements, and you need to document where they’re coming from. Otherwise they turn into these magic numbers that nobody is willing to change.

You also need a certain amount of buffer in the service level objectives. Say you committed in your service level objectives to running a service at 200 milliseconds average latency. If you consistently run it at 190 milliseconds, then you have no room for error—and no room for error means there’s no room for change. You can’t change the system, because that might endanger you meeting your service level objectives.

Automation

The other thing that we need is automation. When are we talking about site reliability engineering, automation has two facets: one of them is that at the scale that we’re talking about, automation is a sheer necessity. We cannot run our systems manually. We need automation—it doesn’t work otherwise. Automation is what allows an operations team to scale. But the other point is: how does automation interact with reliability?

The way automation interacts with reliability is that automation takes the human error out of performing processes. It also introduces the possibility of computer error, and it turns out computers are far more efficient at making mistakes than humans are, and they can make mistakes at a far larger scale and also faster—but with a crucial difference: Automation is testable, processes executed by a human are not testable.

Automation means that you can address the problem of reliability of your automation the same way that you address any other software problem: you address it by testing. Google had a phase in the SRE development where we focused a lot on automating existing processes. Basically you started with a manual fail-over procedure that somebody wrote down, and that you followed whenever we need to fail over to a different datacenter. You went from that to a script, and then you went on to a fancier script and a fancier script and frameworks around the scripts, and more tests and more tests and so on.

At some point we realized that automating processes designed for a human to execute does not really scale, and these days we’re investing a lot more effort into building systems that don’t need that: Systems that are either designed with automation in mind from the beginning, or that don’t need automation at all.

To pick the example of the manual failover: we can design a system for manual failover, and we can automate the steps needed for that failover. But we can also design a system that runs in a hot-hot configuration, where all instances of the system take part of the load. If you spread the load around enough, then one of the instances going down is not a failover anymore. It’s just a loss of capacity. We do capacity adjustments all the time anyway, and the system is just going to deal with that in the course of its normal operations. So at that point we are ending up with a system that does not need automation — because what we used to automate is part of the normal operations of that system.

Releases and Config Management

If releases and config changes are the thing that causes most of your outages, then that is an attractive target for site reliability engineering. Early on in Google SRE, site reliability engineers would often act as gatekeepers for changes to the production systems. We would get new binaries to be deployed from the from the product engineering teams, we would get configuration changes from the product engineering teams, and site reliability engineers would review these, test them, and roll them out carefully, and monitor them and make sure that everything went well.

If you’re in an early stage of the life cycle of an application, this is an incredibly effective way of improving reliability. It also doesn’t scale. You burn a lot of manpower just on playing the gatekeeper function for the production systems. So in recent years what site reliability has focused on has been mostly building infrastructure that allows you to make these kinds of changes safely and reliably. Once we have the infrastructure up to a point where the changes are either safe, or they will get rolled back automatically — then we can just give the responsibility for those changes back to the product engineering teams. Which makes our site reliability engineers happy because they don’t have to burn their own manpower on making these changes, and it makes the product engineering teams happy because they don’t have to talk with site reliability anytime they want to make a change. And everybody is happy.

Oncall and Emergency Response

We talked about monitoring and alerting — the thing that allows you to detect when something is going wrong in your systems. Once you’ve detected it, you need to fix the problem. For the services supported by SRE at Google, it’s usually some site reliability engineers that get paged and then do emergency response. So basically any time something is wrong with Google web search, Gmail, whatever, probably some SRE’s pager is going off somewhere.

There are also some organisations that practice a shared on-call model, where you have members of site reliability engineering and product engineering sharing responsibility for emergency response. It turns out that because of the size mismatch between site reliability teams and product engineering teams, this is surprisingly hard to implement: If you have a small site reliability team — six to twelve people — then everybody is on-call reasonably frequently, at least once every two months, and people are very well trained and very well versed in emergency response. If you try to scale this to a 600 person product engineering team, this is not going to work: people are never going to gain the expertise they need for emergency response, because they’re never on call frequently enough. So this mostly works in very small product engineering teams.

What a lot of product engineering organisations do is that they employ a secondary on-call rotation as an escalation point for site reliability engineering. If there is a problem that site reliability can’t solve, then we can escalate to the product engineering team.

Site reliability only tends to engage with services that need an on-call response time of 30 minutes or less. Basically anything larger than 30 minutes is usually not worth staffing a site reliability team for — because of the minimum size that I mentioned. SREs are incentivized to keep the number of incidents per shift to less than two. Healthy teams, again, have a lot lower number of incidents per shift — usually around .1 to .2 per shift.

One of the advantages of having on-call and emergency response with site reliability engineering I’ve already alluded to: You get a small team that is very well trained and very well practiced in this kind of emergency response. There is another more subtle advantage of having emergency response in a different team: it forces the product engineering teams to get their services into a shape where they can actually reasonably be taken care of by somebody who is not a core product developer. Usually when when SRE starts engaging with a new service, that service goes through an intense period of service hardening and improving the processes around running the service, so that a different team can actually meaningfully do on-call response for that service.

Capacity Planning

How does capacity planning interact with reliability? If you run out of capacity and you overload your service, you are going to lose reliability really really quickly. Capacity planning traditionally also is site reliability engineering territory. Site reliability engineering does the demand assessment, does the forecasting, does capacity assessment by load testing, provisioning, and also maintaining all the infrastructure around it to do this automatically.

Data Integrity

Data integrity is possibly one that doesn’t quite seem to fit in there, because — if the serving systems are up and running, and if they’re producing valid responses, and if they’re if they have capacity, etc. — then what does the data have to do with that?

From the user’s perspective, data integrity problems and service reliability problems are indistinguishable from each other. The user doesn’t really care whether they can’t get their email because your front-end is broken or because there was data corruption in your database. From the user’s perspective, the result is the same — they can’t get their email.

That’s the reason why site reliability also engages with data integrity. Site reliability does things like making sure that all the data stores of a service are backed up, that we have restore procedures, that the restore procedures are tested and executed regularly. Some services have automated continuous restore pipelines, where basically all the time we’re restoring a small part of the data and make sure that what we’re getting back from the backup system is what we’re actually expecting. SRE also often maintains the pipelines that periodically verify the data integrity of the data stores.

What’s next for SRE?

That’s a small sample of building blocks for SRE — some of the parts that SRE engages in. Let’s talk about what’s next for SRE. What are the big challenges for SRE?

One of the problems for Google SRE right now is that — we’ve solved most of the problems. Well, that is not the problem. The problem for Google SRE is that most of the problems we’ve already solved several times in different parts of the company. So the big challenge is — how do we take the accumulated knowledge of more than 10 years of SRE and package it into something that scales across the entire organisation? So that we don’t need to solve the same problems over and over again?

I spent part of my career at Google in a department called Launch Coordination Engineering. Launch coordination engineering is a part of site reliability that engages with new products and new services at launch time — basically make sure that they’re reliable from the start, so that we don’t launch train wrecks from the reliability perspective. This was four to five years ago — and one part of that job was looking at the product engineering team’s requirements, and listening to what they wanted and what problems they were trying to solve — and then saying “oh yes, this engineering team in this other part of the company has already solved that. Go talk with them! Oh, this problem — yes, that engineering team has solved that one. And this one, this other engineering team has solved. And for the other four, we have standard libraries you can just implement.”

This is the mode of operation that we want to get out of. We don’t want to have to have this hands-on approach — this hands-on attention to every service that we run. Because the number of services at Google is growing a lot faster than the number of engineers. And we want to even accelerate that growth — we want to go to more micro services, and automate more of the service management and the service deployment. We cannot afford to give this sort of individualised attention to every service.

So how do we make that scale?

Two ways: One of them is baking more than ten years of SRE experience into Google’s frameworks and into Google’s libraries — so that either the engineer doesn’t have to look for solutions or write them themselves: We can just turn them on, because they’re in the framework anyway. Or in the best case they don’t have to do anything, because these solutions are on by default everywhere, and you don’t have to think about them. So the one approach is baking it into the basic libraries and frameworks; the other approach is standardising, standardising, standardising.

We’re at a point where in SRE, it doesn’t make sense for us anymore to solve problems for a single team. It doesn’t even make sense anymore for us to solve problems for a single product area. If we want to solve a problem, they have to be Google-sized. They have to be able to scale to the entire organisation, to the entire company — and one way of doing that is standardising: Providing a production platform for the product engineering team to run their services on— where SRE provide support for the platform and the product engineering teams run their product on that platform.

I’ve mentioned a couple of times in the talk that site reliability used to do things a certain way, and now we do things differently. We have plans for the future, we have plans how to scale our work. In reality “What is site reliability engineering?” has no fixed answer, because the targets that are most valuable to site reliable engineering change over time and over the years. We’re trying to adapt what site reliability engineering does to the changing environments that we operate in.