Alerting On Error Budget (SLO Series Part 2)
This is Part 2 of HappyFresh SLO series story. Read Part 1 to get an overview of how HappyFresh implement Service Level Objectives (SLO).
What is Error Budget?
It’s nearly impossible to have a 100% target for SLO. Not only it requires high effort and cost, but also it makes no room for changes or improvements. In the real game, we need to do system maintenance, application deployment, experiment, A/B test, etc. Sometimes, they will cause some errors and/or system downtime.
On the other hand, those activities should not sacrifice customers' satisfaction. Hence, we need a measurement to determine how many times a service allowed to become error or not available to users. It’s popularly called an error budget.
Error budget is the total number of failures tolerated by the SLO
For example, our Add To Cart feature has 99.9% availability SLO with 1 million requests per week on average. So, the error budget is 0.1% = 1000 requests per week.
Error budget(#) = (100%-Availability SLO(%)) * Number of Request(#)
Alerting on Error Budget
As we mentioned earlier in Part 1, we already have a dashboard to monitor the error budget. Unfortunately, we should regularly check the dashboard to see the remaining error budget for a particular service. Moreover, we are not notified when the budget is exhausted, and at the same time, the SLO is already breached. By having some alert on the error budget, we can respond to problems before we consume too much of it.
We want to create an alert with good detection time and high precision. There are some approaches when it comes to alerting on the error budget. One that we adopted is using a burn rate alert. Burn rate is how fast, relative to the SLO, the service consumes the error budget.
The alert will be triggered when the error budget consumed (in a time window) are greater than a specific threshold.
Alert Rules
Google SRE team recommends a burn rate of 14.4 in one hour and a burn rate of 6 in six hours as reasonable starting numbers for alerting. In our case, with 7 days rolling window SLO, here are the corresponding burn rates and time windows for percentages of error budget consumed.
| Burn Rate | Budget Consumed | Time Window | Time to Exhaustion |
|-----------|-----------------|-------------|--------------------|
| 14.4 | 8.5% | 1 hour | 12 hours |
| 6 | 21.5% | 6 hour | 28 hours |
When we have a 99.9% availability SLO (the error budget is 0.1% = 0.001), the alert rules are as follows in Scalyr Alert syntax:
- Trigger an alert if the count of failed requests in one hour is more than 8,5% (0,085) of the error budget.
count:1h("failed_requests") > count:1w("all_requests") * 0.001 * 0.085
- Trigger an alert if the count of failed requests in six hours is more than 21,5% (0,215) of the error budget.
count:6h("failed_requests") > count:1w("all_requests") * 0.001 * 0.215
All Hail Slack
We have a dedicated Slack workspace to centralize production alert including the error budget alert. We set up some channels for every service we have. Every team member can join the channels of service that he/she owns or contributes to. By sending an error budget alert to this workspace, our team will acknowledge it faster and have a better insight into what happened in our services since there is some alert from our observability tools too.
This is Part 2 of the SLO Implementation Series at HappyFresh. Leave a 👏 if you enjoyed reading it. We’re also hiring engineers to join us in helping households around South East Asia to get their groceries easily. If you want to know more, visit HappyFresh Tech Career.