What actually is the philosophy of Site Reliability Engineering, and how does it contrast.
I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.
Continuing with: Google’s Approach to Service Management: Site Relaibility Engineering.
While the nuances of workflows, priorities, and day-to-day operations vary from SRE team to SRE team, all share a set of basic responsibilities for the service(s) they support, and adhere to the same core tenets. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). We have codified rules of engagement and principles for how SRE teams interact with their environment — not only the production environment, but also the product development teams, the testing teams, the users, and so on. Those rules and work practices help us to maintain our focus on engineering work, as opposed to operations work.
Wow that’s a list. The gem here is “an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning”
These items are all intricately linked. How can you be responsible for the latency of a service if you’re just deploying someone elses code? Well: you have capacity planning and monitoring and change management to help you.
Often times our job is to be ‘responsible’ for the fact that things are slower or less available than they should be, and to engage with our dev teams to get things fixed in the long term, while applying short term fixes: perhaps we decrease the efficiency by just adding more servers? Or decrease our reliability to run faster.
The following section discusses each of the core tenets of Google SRE.
Ensuring a Durable Focus on Engineering
As already discussed, Google caps operational work for SREs at 50% of their time. Their remaining time should be spent using their coding skills on project work. In practice, this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on. The redirection ends when the operational load drops back to 50% or lower. This also provides an effective feedback mechanism, guiding developers to build systems that don’t need manual intervention. This approach works well when the entire organization — SRE and development alike — understands why the safety valve mechanism exists, and supports the goal of having no overflow events because the product doesn’t generate enough operational load to require it.
I mentioned this in my previous piece: the idea that we have management support to ‘hand back’ products to our developers is a great lever we have.
The 50% cap mentioned here is not accurately measured in any team I have been in at Google. In practice when there is too much operational work to do for a team, we measure concrete things like number of incidents and the number of tickets filed.
The next paragraph starts to tell us how this works.
When they are focused on operations work, on average, SREs should receive a maximum of two events per 8–12-hour on-call shift. This target volume gives the on-call engineer enough time to handle the event accurately and quickly, clean up and restore normal service, and then conduct a postmortem. If more than two events occur regularly per on-call shift, problems can’t be investigated thoroughly and engineers are sufficiently overwhelmed to prevent them from learning from these events. A scenario of pager fatigue also won’t improve with scale. Conversely, if on-call SREs consistently receive fewer than one event per shift, keeping them on point is a waste of their time.
At Google there are no 8 hour on-call shifts in SRE. We work on a 12/12 rotation or a 24-hour rotation. The metric of ‘2 incidents per shift’ applies no matter if it’s 12 or 24 hour shift cycles. This makes sure that engineers on 24-hour rotations also have enough time to handle their incident management and follow-up.
I don’t like treating this metric as an average. Average pages per shift is not something I track here: if average pages per shift is 0.4, but you have 4–5 days a month with 3+ incidents, I would treat that as being a unhealthy pager load and try to address it.
The idea is to make every single day of oncall a reasonable load where every single incident can be handled professionally, not to have any days where you have to prioritize one situation over another.
Postmortems should be written for all significant incidents, regardless of whether or not they paged; postmortems that did not trigger a page are even more valuable, as they likely point to clear monitoring gaps. This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time. Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
Postmortems are covered in detail in chapter 15 of the SRE book.
Something that a colleague said to me this week was: “After the incident today, I was starting on the postmortem with the developer, and they added a paragraph to it explaining how they committed a change that broke production, and I had to talk to them about removing it. Because: If you are focusing on finding someone or something to blame, then you are limiting yourself to just one cause. In a postmortem you should list every single thing that went wrong leading up to the failure, not treating it as a place where we admit what we did wrong.”
Pursuing Maximum Change Velocity Without Violating a Service’s SLO
Product development and SRE teams can enjoy a productive working relationship by eliminating the structural conflict in their respective goals. The structural conflict is between pace of innovation and product stability, and as described earlier, this conflict often is expressed indirectly. In SRE we bring this conflict to the fore, and then resolve it with the introduction of an error budget.
My budget: 21 minutes.
The error budget stems from the observation that 100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes being notable exceptions). In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available. There are many other systems in the path between user and service (their laptop, their home WiFi, their ISP, the power grid…) and those systems collectively are far less than 99.999% available. Thus, the marginal difference between 99.999% and 100% gets lost in the noise of other unavailability, and the user receives no benefit from the enormous effort required to add that last 0.001% of availability.
While I was in Ads SRE, an SRE would look at Google’s ads earnings (based on the publicly released earnings data, not on anything confidential), and calculate how much money per year each extra ‘9’ of availability was worth. It demonstrated that sometimes you do care about the 0.001%!
If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system? This actually isn’t a technical question at all — it’s a product question, which should take the following considerations into account:
What level of availability will the users be happy with, given how they use the product?
What alternatives are available to users who are dissatisfied with the product’s availability?
What happens to users’ usage of the product at different availability levels?
Testing what happens to users’ usage when the product is slightly less reliable is quite easy to do:
if (user->in_experiment_set() && random()<0.0005), it’s just not the kind of test that’s easy to get your manager to approve.
The business or the product must establish the system’s availability target. Once that target is established, the error budget is one minus the availability target. A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it.
Try not to spend it all in one place.
So how do we want to spend the error budget? The development team wants to launch features and attract new users. Ideally, we would spend all of our error budget taking risks with things we launch in order to launch them quickly. This basic premise describes the whole model of error budgets. As soon as SRE activities are conceptualized in this framework, freeing up the error budget through tactics such as phased rollouts and 1% experiments can optimize for quicker launches.
This introduces the idea of phased rollouts: If you roll out features a little bit at a time, and the new feature is 20% unreliable, but only accessible to 5% of your users, that’s only a 1% outage, and only for the length of time it takes to notice.
Having rollouts that are slow means you can roll out potentially unreliable software, but you will only be able to keep inside your error budget if you notice when you’re rolling out a dud.
The use of an error budget resolves the structural conflict of incentives between development and SRE. SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. This change makes all the difference. An outage is no longer a “bad” thing — it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.
I have legitimately had conversations with my developers along the lines of: “You’re being too careful. Stop triple checking everything and rolling out so slowly.”
Monitoring is one of the primary means by which service owners keep track of a system’s health and availability. As such, monitoring strategy should be constructed thoughtfully. A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs. However, this type of email alerting is not an effective solution: a system that requires a human to read an email and decide whether or not some type of action needs to be taken in response is fundamentally flawed. Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
I have a rule: There’s no such thing as an alert that has no action required. If you do encounter one that you think has no action required, your action is automatically to fix that alert.
Various things I count as appropriate actions:
- Delegating responsibility for fixing and followup to another team.
- Filing a bug to follow up on why this happened.
- Educating someone on better ways to do things that don’t cause me to be paged.
- Updating the existing bug with more debugging information.
- Tuning the alert thresholds because the system paged when users were not experiencing pain.
- Deleting the alert because it’s useless and irrelevant.
It is never okay to do nothing.
There are three kinds of valid monitoring output:
Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
This makes every device I own make a great deal of noise. If I don’t respond within 5 minutes, it will fall through to my secondary whose job it is to take action instead.
Our team has a rule called the “Flowers” rule, that if it falls through to secondary, it will have been because something so tragic has to have happened to the primary that you will need to send them flowers and a card.
Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
When I’m oncall, I’m lucky if I look at my ticket queue 4 times a day. They’re important: just not time critical. Often they’re also super time consuming because they need a lot of analysis to understand.
“Why the heck are we doing 25% more disk I/O today than a week ago, and hitting throughput limits?”
No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
I’m going to point at this next time someone tries to implement more ‘severity: email’ alerts. All the ones I receive are useless noise.
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR) [Sch15]. The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health — that is, the MTTR.
Humans add latency. Even if a given system experiences more actual failures, a system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands-on intervention. When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.” The hero jack-of-all-trades on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better. While no playbook, no matter how comprehensive it may be, is a substitute for smart engineers able to think on the fly, clear and thorough troubleshooting steps and tips are valuable when responding to a high-stakes or time-sensitive page. Thus, Google SRE relies on on-call playbooks, in addition to exercises such as the “Wheel of Misfortune,” to prepare engineers to react to on-call events.
The quality variation in playbooks can be amazing. Anywhere from “this is so out of date literally nothing here is true”, to “this playbook would take me 45 minutes to read, and the important information was buried in a link in the end of a paragraph near the end”, to “this playbook told me exactly how to diagnose and fix the problem in the first sentence”.
We actually have lots of contrary opinions on different teams about playbooks. One is that you should know all the operational actions you could possibly take already, so you don’t’ need a playbook, what you need is an accurate monitoring console so you know what action to take.
Not having a playbook is a luxury that only teams who have a very small number of things they support have. Those of us who support much more variety can’t hold it all in our head.
In contrast, my personal opinion is that if the playbook isn’t useful, neither is the alert and we should either rebuild it, or delete both the alert and the playbook. Because having alerts without sufficiently useful background information available is so dangerous.
SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish the following:
Implementing progressive rollouts
Quickly and accurately detecting problems
Rolling back changes safely when problems arise
This trio of practices effectively minimizes the aggregate number of users and operations exposed to bad changes. By removing humans from the loop, these practices avoid the normal problems of fatigue, familiarity/contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase.
This section is waaaaaaay too small. I’m sure other chapters will touch on this, but I’m going to add some emphasis here:
Let me repeat this, in bold: 70% of outages are due to changes in a live system. 70% of outages could be avoided by doing nothing! We are our own enemy.
So the basic tool to reduce your error budget spending by up to 70% is to notice when your rollouts are bad and roll back quickly. It’s such a simple thing! How long does it take you to notice it’s a bad release? How long does it take to roll it back? How many people did it affect before the rollback was complete?
These are questions you should absolutely know the answers to.
Demand Forecasting and Capacity Planning
Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability. There’s nothing particularly special about these concepts, except that a surprising number of services and teams don’t take the steps necessary to ensure that the required capacity is in place by the time it is needed. Capacity planning should take both organic growth (which stems from natural product adoption and usage by customers) and inorganic growth (which results from events like feature launches, marketing campaigns, or other business-driven changes) into account.
Several steps are mandatory in capacity planning:
An accurate organic demand forecast, which extends beyond the lead time required for acquiring capacity
This is sometimes as simple as drawing a graph of growth, and using a ruler and squinting.
No joke. It’s sometimes much more accurate than some of the fancy mathematical models I’ve seen applied!
An accurate incorporation of inorganic demand sources into the demand forecast
The biggest time of the year for advertisers is the week before Thanksgiving: every advertising company putting in extra hours to make sure their ad campaigns are spot on for attracting business on Black Friday.
It’s a great example of ‘inorganic demand’.
Regular load testing of the system to correlate raw capacity(servers, disks, and so on) to service capacity
Load testing doesn’t have to be done in a synthetic manner, you can look at the behavior of a running system under load to understand its performance, and you often get better numbers.
It’s important to know that when you reach boundary conditions (such as CPU starvation or memory limits) things can go catastrophic, so sometimes it’s important to know where those limits are.
Because capacity is critical to availability, it naturally follows that the SRE team must be in charge of capacity planning, which means they also must be in charge of provisioning.
I fundamentally disagree with the words “in charge” here. SRE is at every point fully entitled to hand doing any piece of work to another team. The key is that we are responsible for making sure that the capacity planning is done and meets our requirements, but it can be done by anyone.
I have in the past worked with product teams that have done an entirely reasonable job of capacity planning their service, and left it to them. We did however check their work.
It is true that capacity is critical to availability. The most important tickets I receive are capacity tickets. It feels like a lot of the incidents I work on have their root cause in running out of capacity in some subsystem.
Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive. This exercise must also be done correctly or capacity doesn’t work when needed. Adding new capacity often involves spinning up a new instance or location, making significant modification to existing systems (configuration files, load balancers, networking), and validating that the new capacity performs and delivers correct results. Thus, it is a riskier operation than load shifting, which is often done multiple times per hour, and must be treated with a corresponding degree of extra caution.
At Google, provisioning (Or ‘Turn-ups’ in Google-speak) can be as simple as changing a number and hitting a button, or it could be an expensive and risky multi-day process.
One of the focuses I’ve had in my time here has been to make provisioning (and re-provisioning, if we need to move resources around) as risk-free and painless as possible.
This has never been risk-free or painless, the best I’ve ever managed is “only a little risky” and “not unbearably painful”. This is what happens when you have such a high degree or risk and services that humans can interact with but don’t have good programatic APIs for.
One lesson I’ve learned here is: Don’t ever try to make a computer do provisioning like a human would: copying the humans steps such as “edit a file, send it for review, run a script, check the console” is hopeless. Instead doing it in a liquid and constant way “when utilization > x%, add 1 more server”, “when servers appear in a new location, add them to the load balancer’s backend set”.
Efficiency and Performance
Efficient use of resources is important any time a service cares about money. Because SRE ultimately controls provisioning, it must also be involved in any work on utilization, as utilization is a function of how a given service works and how it is provisioned. It follows that paying close attention to the provisioning strategy for a service, and therefore its utilization, provides a very, very big lever on the service’s total costs.
Resource use is a function of demand (load), capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software. These three factors are a large part (though not the entirety) of a service’s efficiency.
Software systems become slower as load is added to them. A slowdown in a service equates to a loss of capacity. At some point, a slowing system stops serving, which corresponds to infinite slowness. SREs provision to meet a capacity target at a specific response speed, and thus are keenly interested in a service’s performance. SREs and product developers will (and should) monitor and modify a service to improve its performance, thus adding capacity and improving efficiency.
Before reading this book, and subsequently dissecting it for this blog I would have answered “no” to the question “Are you responsible for the performance of your binaries”. But I have totally changed my mind now.
In exactly the same way that the reliability of a system is the responsibility of the SREs, with an expectation that they will prevent a new release causing user-visible errors, it is also the responsibility of the SRE team to detect, mitigate and prevent performance regressions.
I do not think my team does a good job of monitoring and preventing performance regressions to do with new software releases, but we do put a lot of care into analyzing system architecture and making sure that as we go through the Production Readiness Review of new launches that the system is fast enough to meet our expectations.
The End of the Beginning
Site Reliability Engineering represents a significant break from existing industry best practices for managing large, complicated services. Motivated originally by familiarity — “as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks” — it has become much more: a set of principles, a set of practices, a set of incentives, and a field of endeavor within the larger software engineering discipline. The rest of the book explores the SRE Way in detail.
Closing thoughts: This is the end of Chapter 1 of the SRE book. I’ve given my perspective, and we have only scratched the surface of what SRE is about.
I feel like so far this has been pretty much representative of what the job is about, and I’m looking forward to the next chapter: The Production Environment at Google, from the Viewpoint of an SRE.