Motivation for Error Budgets
Error budgets represent the amount of failure we expect to actually have.
I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.
This is commentary on the last section of Chapter 3: Embracing Risk.
Motivation for Error Budgets
Written by Mark Roth
Edited by Carmela Quinito
Other chapters in this book discuss how tensions can arise between product development teams and SRE teams, given that they are generally evaluated on different metrics. Product development performance is largely evaluated on product velocity, which creates an incentive to push new code as quickly as possible. Meanwhile, SRE performance is (unsurprisingly) evaluated based upon reliability of a service, which implies an incentive to push back against a high rate of change. Information asymmetry between the two teams further amplifies this inherent tension. The product developers have more visibility into the time and effort involved in writing and releasing their code, while the SREs have more visibility into the service’s reliability (and the state of production in general).
My performance has never been linked to the reliability of a service: only to my response to the services unreliability. This is an important distinction in light of the passage above “SRE performance is … evaluated based upon reliability of a service”. The environment at Google is really infectious: Personal blame for failure is not used, especially in SRE.
For instance: If someone goes and breaks a system by running a command and takes it down for an hour, we treat that as a systematic failure, and the team as a whole takes responsibility for making sure it never happens again. The team is responsible for making sure that that class of failure is no longer possible in the future, not for breaking it accidentally in the past.
These tensions often reflect themselves in different opinions about the level of effort that should be put into engineering practices. The following list presents some typical tensions:
Circling back: in previous chapters we’ve talked about the needs of product to get out new features, and systems administrators (or similar) not wanting anything to ever change.
Software fault tolerance
How hardened do we make the software to unexpected events? Too little, and we have a brittle, unusable product. Too much, and we have a product no one wants to use (but that runs very stably).
I had to think of an example of a stable product that no one would want to use because they’re too fault tolerant, so I came up with two:
- A system that serves a website, but the website cannot be updated quickly, because first all the content is distributed globally to a network of independent web servers, which only serve static content.
- A database system that guarantees extremely fast, stable, write semantics. But isn’t possible to query until the nightly maintenance window when all the data is transcribed from journals to the serving index.
These would be extremely stable, highly reliable systems.
Again, not enough testing and you have embarrassing outages, privacy data leaks, or a number of other press-worthy events. Too much testing, and you might lose your market.
I think this means “Testing takes time.” I don’t think there’s an actual causal relationship between good testing and losing your market. :)
You can use your error budget in lieu of exhaustive testing.
Every push is risky. How much should we work on reducing that risk, versus doing other work?
Pushes have two risks: downtime (i.e. restarting servers, dropping queries while the server is offline), or because the new push contains bad code.
It might also be a long time before a code bug from a push is found, which makes them even more risky.
Canary duration and size
It’s a best practice to test a new release on some small subset of a typical workload, a practice often called canarying. How long do we wait, and how big is the canary?
Canarying is the practice of running the old version and the new version of the software side by side, so that you can be sure the new version is operating correctly. Often you can tell this quite quickly (it’s crashing), or by analysing metrics (which is much harder to do programmatically: Student’s T-tests are beyond my skills in mathematics).
Usually, preexisting teams have worked out some kind of informal balance between them as to where the risk/effort boundary lies. Unfortunately, one can rarely prove that this balance is optimal, rather than just a function of the negotiating skills of the engineers involved. Nor should such decisions be driven by politics, fear, or hope. (Indeed, Google SRE’s unofficial motto is “Hope is not a strategy.”) Instead, our goal is to define an objective metric, agreed upon by both sides, that can be used to guide the negotiations in a reproducible way. The more data-based the decision can be, the better.
Canarying is of course the most trivial case of risk averse rollout procedure. You might canary in just one stage, then roll out everything else: or perhaps roll out in successive stages for an entire week.
When you have a very tight error budget, rolling out very slowly to small fractions of your users first makes rollouts feel dreadfully slow, but it helps immensely with keeping failure contained.
For more information about what a canary is, and how they might benefit you: Adrian Hilton wrote a recent article on the Google Cloud Platform blog “How release canaries can save your bacon — CRE life lessons”.
Forming Your Error Budget
In order to base these decisions on objective data, the two teams jointly define a quarterly error budget based on the service’s service level objective, or SLO (see Service Level Objectives). The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
By “removes the politics” we really mean that we agree ahead of time between all high level stakeholders what’s important, so that negotiations are around pre-existing value judgements: “Abbreviating QA is too risky, we’re out of error budget” or, “We’ve got lots of room in the error budget, lets go from weekly to daily releases.”
Our practice is then as follows:
- Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
- The actual uptime is measured by a neutral third party: our monitoring system.
- The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
- As long as the uptime measured is above the SLO — in other words, as long as there is error budget remaining — new releases can be pushed.
My personal view of this list is that it’s out of order. The correct thing to do is measure first (obtain an accurate SLI, Service Level Indicator), and then second define an appropriate SLO. Don’t start with defining the SLO and then find out later what you can measure!
The SLI should be measured, well understand and vetted to make sure that it’s an appropriate analog of the user-experience. Then, ideally an SLO should be picked that is at a point between the historical behavior of the system when it is functionally well, and where the customer will notice there’s something wrong.
- Measure the latency of a system.
- Historically, it responds to all queries in under 50ms, but sometimes they can be as slow as 90ms.
- Users notice when it gets as slow at 200ms.
- Define your SLO at 100ms, because that’s slower than you ever typically go, but before users will notice. So any production issue causing extra latency will be spending your budget, but you won’t consume it at all in normal operation.
The next chapter of the book is all about Service Level Objectives. We will likely discuss this much more there.
For example, imagine that a service’s SLO is to successfully serve 99.999% of all queries per quarter. This means that the service’s error budget is a failure rate of 0.001% for a given quarter. If a problem causes us to fail 0.0002% of the expected queries for the quarter, the problem spends 20% of the service’s quarterly error budget.
And present these numbers to everyone. You absolutely must present to all stakeholders in a clear way how much budget is remaining. Without this, you can’t have reasonable discussions about reliability trade-offs.
The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.
Many products use this control loop to manage release velocity: as long as the system’s SLOs are met, releases can continue. If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient, improve its performance, and so on. More subtle and effective approaches are available than this simple on/off technique:15 for instance, slowing down releases or rolling them back when the SLO-violation error budget is close to being used up.
In my experience it’s as simple as “We’re over our error budget, all non-critical releases are frozen until we recover.”
This only works with a 30 day window, otherwise you can imagine a huge outage causing you to halt all feature work for a year!
For example, if product development wants to skimp on testing or increase push velocity and SRE is resistant, the error budget guides the decision. When the budget is large, the product developers can take more risks. When the budget is nearly drained, the product developers themselves will push for more testing or slower push velocity, as they don’t want to risk using up the budget and stall their launch. In effect, the product development team becomes self-policing. They know the budget and can manage their own risk. (Of course, this outcome relies on an SRE team having the authority to actually stop launches if the SLO is broken.)
At Google we are very fortunate to have Ben Treynor, who is a very reasonable man and cares deeply about reliability. He is the VP in charge of all of SRE at Google, and is more than willing to defend SRE’s authority to protect the reliability of a service.
Having an error budget console and a pre-existing written commitment to an SLO from product management will very quickly resolve these issues, and very rarely will an escalation all the way To The Top be necessary.
What happens if a network outage or datacenter failure reduces the measured SLO? Such events also eat into the error budget. As a result, the number of new pushes may be reduced for the remainder of the quarter. The entire team supports this reduction because everyone shares the responsibility for uptime.
Important to mention here: a push freeze isn’t a punishment for product, it’s just an acknowledgement that we want the service to be reliable, and if we don’t achieve that, then measures will be taken.
We don’t make different decisions because the error budget was consumed by a bad push or an exploding datacenter transformer. We make our decisions based on our customer’s experience.
The budget also helps to highlight some of the costs of overly high reliability targets, in terms of both inflexibility and slow innovation. If the team is having trouble launching new features, they may elect to loosen the SLO (thus increasing the error budget) in order to increase innovation.
Perfectly valid. I sometimes have to convince teams to be less reliable! It’s sometimes a hard sell, because they’re so used to the idea of having tight SLOs.
For instance, a system that historically has been rock solid on 99.99% reliability but is too hard to develop new features for, might go to daily releases and downgrade to just 99.95% or 99.9% reliability so they can move faster and capture more opportunities.
Managing service reliability is largely about managing risk, and managing risk can be costly.
100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take.
An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to reach the same conclusion about production risk without rancor.
Graph and present your error budget. Make it visible! Put it on a wall display!
This concludes chapter 3 of The SRE Book, and I will be continuing with Chapter 4: Service Level Objectives.