Error Budget Policy Adoption at Expedia Group
Using data to set the standard for reliability
As a travel platform, our goal at Expedia Group™ is to provide reliable experiences for prospective travelers. Although 100% reliability in a system is not practical, we want to be reliable enough to keep our customers happy. To achieve it, many teams are increasingly adopting service level objectives (SLOs) and error budgets across Expedia Group. But what should a team be doing if they are missing the SLOs and going out of budget? Maintaining the intended reliability level is not feasible unless the team adopts error budget policies based on a data-driven approach we propose here. We hope teams will adopt well-defined policies and be prepared to handle these situations.
The main goal is to eliminate bad customer experiences due to repeated SLO misses. Teams will be able to deliver new features while keeping up with the reliability expectations. “Eliminate Preventable Bad Experiences” is one of our objectives and key results (OKRs), and we are going to achieve it by operationalizing error budget policies. As a side effect, this will help stakeholders agree upon the product priorities without contention.
Well-defined error budget policies will improve day-to-day reliability provided by teams. Critical problems that are immediately clear can be prevented (and fixed) by short change freezes. In contrast, error budgets can help prevent gradual degradation. Error budget policies give teams autonomy to manage their reliability while being able to communicate to the rest of the organization how reliable they have been.
Out of scope
These error budget policies are not meant to be punishments for SLO violations. At Expedia Group, we treat each and every failure as an opportunity to learn! These error budget policies encourage teams to prioritize reliability over product features when needed and take tolerable risks when there is enough budget left.
- Service Level Indicator (SLI) — A quantifiable measure of service reliability that tells you if things are working.
- Service Level Objective (SLO) — A reliability target for an SLI that tells you if the users are happy or sad.
- Service Level Agreement (SLA) — A contract that the service provider promises users on service reliability. Agreed reliability targets should be more relaxed than the internal SLOs to give enough room for unexpected outages.
- Error Budget — Either the number of requests out of the total requests or the amount of time for a given period, a system can afford to be unreliable before users become unhappy.
- Error Budget Policy — Defines what you will do in case of SLO violations.
- Burn Rate — How fast, relative to the SLO, the service consumes the error budget.
- System Boundary — A point at which one or more components (a system) exposes functionality to its users.
- Capabilities — A set of functionalities provided by a system at the system boundary.
- User — The customer of a given system. This can be a human customer (travelers or internal users) or a system (machines).
- Site Reliability Engineering (SRE) — Engineering practices adopted by an organization in order to maintain the expected reliability of software systems.
Who should read this document?
It is expected that you have a clear understanding about what capability, service level indicator (SLI), service level objective (SLO) and error budget mean. If you have error budgets set up for your system and are looking forward to implementing error budget policies or if you are someone who likes to practice operationalizing error budgets, you should follow this document.
This illustration shows how all these concepts are tied together. To emphasize the goal:
We are going to keep the customers happy by maintaining the expected system reliability.
Prerequisites for effective monitoring
For a given system, before adopting the proposed error budget policies, the following prerequisites must be met:
- System boundaries, capabilities & users are defined.
- System is set up to send proper metrics to the monitoring system (in our case it is Datadog).
- SLIs & SLOs are defined and measured in dashboards.
- Error budgets are set.
If these requirements are not met, a team should focus on implementing them before attempting to define the error budget policies.
Types of errors
Before laying out the error budget policies, a team must document the answer to this question:
What are the errors that will consume the error budget, and which are tolerated?
The following cases are examples of acceptable errors that will not consume the error budget when they occur:
- Company wide network outages.
- Cloud vendor outages.
- Known downtimes for maintenance.
- Miscategorized errors when no real users were impacted.
- Errors consumed by users out of scope for the SLOs (e.g., load tests, integration tests).
- Outages caused by a service maintained by another team (internal dependencies).
- Outages caused by a service maintained by a 3rd party vendor (external dependencies).
Any other errors experienced by the users are taken out of your application’s error budget.
This is a general guidance. Individual teams will have the flexibility to make modifications to this list over time.
Error budget window
We want to encourage flexibility in choosing a time window for the error budget. While each team could create their own budget windows, collaboration would be easier if each team implemented a common budget window. As a rule-of-a-thumb we are proposing a rolling 30-day window. Fixed time windows, such as monthly, are discouraged because they will refresh the remaining budget to 100% at the beginning of each window while the customer happiness does not behave that way.
Monitoring and alerting
Monitoring and alerting should be set in place to notify the teams if their SLOs are not met, or they are on the way to consuming the error budget soon. It is recommended to have alerts based on both error budget consumption level and elevated burn rates.
Alerting on error budget consumption
Alerting should be set up based on total error budget consumed during a 30-day budget window.
Alerting on elevated burn rate
Burn rate is how fast the service consumes the error budget for a given SLO. It is calculated as the ratio between your observed error rate and the maximum tolerable error rate (i.e., error rate for the system to consume the entire budget during the budget window) during an alert window.
Given an error budget window of 30 days and an SLO of 99%, you will have an error budget of 1% for that period. Let’s say an error rate of 2% was observed, then your burn rate is 2. Your application will consume the error budget in 15 days (within half of the budget window). If you observed an error of 10% then your burn rate is 10 and it will consume the error budget in 3 days!
For alerting based on the burn rate, the following three factors should be considered:
- Budget Window — The length of the error budget window. We recommend 30 days (720 hours).
- Alert Window — The length of the alert calculations window, such as 1 hour.
- Budget Consumption — The percentage of the error budget consumed during the alert calculation window.
Then the observed burn rate is calculated as:
Recommendation: For alerting based on burn rate, for a 30-day budget window, the following thresholds are proposed.
The proposed framework in summary
The proposed approach for applying necessary error budget policies is:
- Monitor the error budget related to an SLO.
- Set up alerts based on budget consumption level and elevated burn rates.
- Classify the reliability state of the system based on alerts and error budget level.
- Follow the policies defined for that particular reliability state.