Data-driven negotiation with SLIs, SLOs and Error Budgets (2/2)

Published in

Jumpstart

6 min readAug 23, 2021

Photo by Christina @ wocintechchat.com on Unsplash

This article was originally published in CTO Craft.

In the previous post, I wrote about capturing user happiness in metric form with SLIs and SLOs, with the end goal of providing the optimal level of software reliability that maximizes user happiness.

In this post, I’m writing how Error Budgets can be used to negotiate in a data-driven way, the trade-offs between innovation and reliability and between risk and stability.

What systems are unreliable

Systems can become unreliable (and unavailable) because of Business-As-Usual (BAU). Just as humans generate dust simply by living, engineers generate bugs simply by coding. Other BAU events are hardware/power/network failures caused by cute, furry, or feathered friends or failures in third-party dependencies.

That’s what Error Budgets are for: to measure the acceptable level of system unreliability. There is such thing as ‘too much reliability’ and it can be bad business. Each .9 adds 10x cost.

The need for an Error Budget

Everything is a trade-off. Product performance is evaluated using velocity while platform performance is evaluated using reliability.

“The structural conflict is between pace of innovation and product stability. The error budget stems from the observation that 100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes being notable exceptions). If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system? This actually isn’t a technical question at all — it’s a product question, which should take the following considerations into account: What level of availability will the users be happy with, given how they use the product? What alternatives are available to users who are dissatisfied with the product’s availability? What happens to users’ usage of the product at different availability levels?” — Google SRE book

Product/Engineering and Business have to constantly negotiate the balance between the value added by new features and the value lost through bugs, outages, tech debt, etc.

An Error Budget is a data-driven way to convince the leadership to invest in development velocity for the long run, meaning:

When to prioritize bugs and Post-Mortem actions in the next planning cycle.
When to implement automation, monitoring, observability.

Just like a household budget, if we have the money, we can spend it on new features, if not, we have to reduce our innovation expenses.

Looking at the Error Budget exhaustion rate is as useful as managing overspending money.

If the rate > 1, we’re consuming the budget faster than we should and we’ll get into debt.

We can also have special types of Error Budgets, but those are usually a bad sign and should warrant a Post-Mortem as to why we had to use them.

A Rainy Day Fund for unexpected events.
Silver Bullets for “critical” new features.

The Error Budget equation

Time-To-Detect (TTD)

The time it takes from the moment a user is impacted by an issue until someone is informed of it.

Time-To-Resolution (TTR)

The time it takes from someone being informed of an issue until the issue is resolved.

Time-To-Failure/Time-Between-Failures (TTF)

The frequency of a particular failure.

When they come with an ‘M’ in front, aka ‘Mean Time to X’, they are averaged.

Error Budget = 1 — Availability Target

The equation above tells us how we can decrease unavailability, and consequently increase availability.

1. Decrease Time-To-Detect

Monitoring and alerting catch outages faster.

2. Decrease Time-To-Resolution

Make it quicker to troubleshoot with good developer Runbooks.
Improve logs for firefighting.
Add traces.
Automate fail-overs like redirecting traffic or backups.

3. Decrease the impact

Limit the number of users affected with a gradual roll-out.
Increase reversibility with feature flags.
Implement graceful degradation, e.g. Circuit Breaker Pattern, throttle requests, limit retry calls with exponential backoff, set client timeouts, limit queues.

3. Increase Time-To-Failure

Analyze and understand the cause of failure.
Do proactive maintenance work.

The properties of a good Error Budget policy

Since missed SLOs indicate that users are unhappy, it’s in the interest of the business to have a mechanism that enforces investment in engineering work to improve reliability.

Such a mechanism is provided by an Error Budget Policy, which outlines the trade-offs between reliability and feature work. Implementing and following an Error Budget Policy not only results in increased reliability and customer happiness but also in decreased firefighting and finger-pointing within teams. It’s a win-win situation.

An Error Budget is likely to apply to multiple services and teams across the organization. It’s best kept and maintained in a highly visible place and stored as metadata next to the SLO definition (for example as a link).

A good Error Budget Policy has seven properties:

If the Error Budget is exhausted or threatened, the Policy should be able to enforce engineering efforts to re-prioritize features that improve reliability.
The policy should clarify when this reprioritization takes effect, for example when the budget is closed to exhaustion.
It describes how teams will prioritize reliability work. For example, if the budget is threatened but not exhausted, one or two developers are allocated to fix all priority issues from the relevant Post-Mortems. On the other hand, if the budget has been exhausted for months in a row, maybe the entire dev team should focus solely on reliability work until the budget is replenished to a comfortable level.
To enforce a policy, it has to come with important consequences and risks if the reliability work doesn’t happen. At the end of the day, this work is needed to meet customer happiness. If that doesn’t happen, the business is ultimately failing at its one of their core values.
The policy should be consistently applied across teams and throughout the year. There might be one or two exceptions, like Silver Bullets or pushing an urgent feature because of a potential breach of contract. However, Silver Bullets should be treated as extraordinary circumstances and potentially followed by a Post-Mortem explaining how they can be avoided in the future.
The policy needs a final owner and decision-maker because disagreements between parties (e.g. Product and Engineering or different dev teams) can happen.
It’s difficult for people to adhere to a policy they dislike. However, once all the parties involved (product managers, developers, SREs, executives, etc) provide feedback that is analyzed and incorporated into the policy, everyone should commit to following the policy for actual results to show.

Example Error Budget Policy

Example of Error Budget Policy scenarios and escalation

Google’s CRE Life Lessons — Applying the Escalation Policy has four scenarios illustrating how to apply the policy thresholds for a service that desires “three nines” availability but burns half of its error budgets on background errors.

Example of CRE Risk Analysis template

[TODO]

Example CRE Risk Analysis Template