In this decade we entered a new era of IT: rapid release as a standard. This practice has grown from niche to such a common practice that many open source projects are available to assist your rapid release pipeline, and it’s one of the reasons our industry standardized so quickly on Kubernetes. But many enterprises seem to forget that there’s an associated practice you need to adopt to make rapid releasing a success — the Error Budget: specifying how much time it is acceptable for your system or component to have reduced availability. Without an Error Budget, Dev and Ops are in tension, not the synchronous balance that is DevOps.
If you are already applying Error Budgets and the associated practices to your releases, well done, you have a nicely mature rapid release process in place. If not, you really need to read on…
Dev vs Ops
There’s always been a little tension between what Dev need to do and what Ops needs to do. Dev needs to push out changes when they’re ready, at whatever velocity they’re managing; Ops needs to keep the system stable and running and available. That doesn’t directly cause any tension, but the majority of downtime in systems has historically been due to changes pushed to production. And that’s where the tension comes. Historically Ops enforced change windows or regular release cadences of months to limit the speed of changes — and restrain downtime.
That old style enforcement is gone, but the downtime caused by changes has not — and rapid release has made the likelihood of downtime worse because changes happen more frequently. Fortunately, Google came up with a solution to this tension that makes both Dev and Ops happy — the Error Budget. Error budgets are how you eliminate the tension between teams wanting to push new features as fast as possible, and operations wanting a stable system.
Error Budgets Reduce Downtime While Maintaining a High Release Rate
There’s a lovely description in chapter 3, Embracing Risk of Google’s Site Reliability Engineering book which I will quote here:
For example, if product development wants to skimp on testing or increase push velocity and SRE is resistant, the error budget guides the decision. When the budget is large, the product developers can take more risks. When the budget is nearly drained, the product developers themselves will push for more testing or slower push velocity, as they don’t want to risk using up the budget and stall their launch. In effect, the product development team becomes self-policing. They know the budget and can manage their own risk. (Of course, this outcome relies on an SRE team having the authority to actually stop launches if the SLO is broken.)
Without Error Budgets, production incidents typically increase as the frequency of releases increase. If you want to increase innovation by taking more risks — which is what most enterprises are now targeting — Error Budgets are essential, otherwise there is no limitation on the risks being taken.
Many products use this control loop to manage release velocity: as long as the system’s SLOs are met, releases can continue. If SLO violations occur frequently enough to expend the error budget, releases are temporarily halted while additional resources are invested in system testing and development to make the system more resilient, improve its performance, and so on. More subtle and effective approaches are available than this simple on/off technique: for instance, slowing down releases or rolling them back when the SLO-violation error budget is close to being used up.
So how do you define an error budget? Well for a start it is set by the business/product management, as it is not a technical choice but a business one. Downtime has costs but excessive reliability also has costs, so you’re trying to find some balance where neither costs too much.
You should do a cost-benefit analysis of how much downtime you will accept for the system by considering how much it costs to make the system that reliable (100% reliability is very very expensive), and how much you lose while the system is down (and don’t forget the associated future loss from business reputation failure that often costs much more than just the immediate loss).
Then you apply the Error Budget using SLOs that show when availability is below acceptable levels, using SLIs to measure that availability. Error Budgets are commonly set on a quarterly basis, eg an error budget of 0.1% (ie 99.9%, or three 9s, reliability) allows the system to be down for just over 2 hours per quarter.
It may seem a little daunting to do a cost-benefit analysis like that. You could instead use historical knowledge of how much unavailability has been acceptable for your system or component before it became an escalated issue. Or you can start with a guess — it doesn’t have to be right first time, you can adjust the Error Budget over time as best works for your system or component.
SLIs & SLOs
Typically, you use something to generate SLIs, you define SLOs for alerting using those SLIs (eg an Apdex score below 70% or Availability below 80% is a failure that needs alerting and mitigating), and the amount of time the SLOs are violated shows how much of your Error Budget has been used.
At Hotels.com (part of Expedia Group), within the Platform team, we’ve defined SLOs for our platform components so that we have quality measures of their availability. Degradation is quickly alerted on, the appropriate team is notified, and uptime is maximized — we’ve seen a noticeable improvement in responding to service degradation since adding SLI monitoring and alerts to our platform components. Not all components have Error Budgets yet, but they will and we expect even better reliability in the future.
What You Need to Do
If you are releasing frequently, the important thing is to have an Error Budget in place, even if it’s the wrong budget at the beginning. You can easily review and adapt the budget as you learn what works best!