Alerting On Error Budget (SLO Series Part 2)

Riky Lutfi Hamzah
HappyTech
Published in
3 min readSep 29, 2020

This is Part 2 of HappyFresh SLO series story. Read Part 1 to get an overview of how HappyFresh implement Service Level Objectives (SLO).

What is Error Budget?

It’s nearly impossible to have a 100% target for SLO. Not only it requires high effort and cost, but also it makes no room for changes or improvements. In the real game, we need to do system maintenance, application deployment, experiment, A/B test, etc. Sometimes, they will cause some errors and/or system downtime.

Error 500 page in HappyFresh Web
Error 500 page in HappyFresh Web

On the other hand, those activities should not sacrifice customers' satisfaction. Hence, we need a measurement to determine how many times a service allowed to become error or not available to users. It’s popularly called an error budget.

Error budget is the total number of failures tolerated by the SLO

For example, our Add To Cart feature has 99.9% availability SLO with 1 million requests per week on average. So, the error budget is 0.1% = 1000 requests per week.

Error budget(#) = (100%-Availability SLO(%)) * Number of Request(#)

Alerting on Error Budget

As we mentioned earlier in Part 1, we already have a dashboard to monitor the error budget. Unfortunately, we should regularly check the dashboard to see the remaining error budget for a particular service. Moreover, we are not notified when the budget is exhausted, and at the same time, the SLO is already breached. By having some alert on the error budget, we can respond to problems before we consume too much of it.

HappyFresh’s error budget dashboard
Error budget dashboard on Grafana

We want to create an alert with good detection time and high precision. There are some approaches when it comes to alerting on the error budget. One that we adopted is using a burn rate alert. Burn rate is how fast, relative to the SLO, the service consumes the error budget.

The alert will be triggered when the error budget consumed (in a time window) are greater than a specific threshold.

Alert Rules

Google SRE team recommends a burn rate of 14.4 in one hour and a burn rate of 6 in six hours as reasonable starting numbers for alerting. In our case, with 7 days rolling window SLO, here are the corresponding burn rates and time windows for percentages of error budget consumed.

| Burn Rate | Budget Consumed | Time Window | Time to Exhaustion |
|-----------|-----------------|-------------|--------------------|
| 14.4 | 8.5% | 1 hour | 12 hours |
| 6 | 21.5% | 6 hour | 28 hours |

When we have a 99.9% availability SLO (the error budget is 0.1% = 0.001), the alert rules are as follows in Scalyr Alert syntax:

  • Trigger an alert if the count of failed requests in one hour is more than 8,5% (0,085) of the error budget.
count:1h("failed_requests") > count:1w("all_requests") * 0.001 * 0.085
  • Trigger an alert if the count of failed requests in six hours is more than 21,5% (0,215) of the error budget.
count:6h("failed_requests") > count:1w("all_requests") * 0.001 * 0.215

All Hail Slack

We have a dedicated Slack workspace to centralize production alert including the error budget alert. We set up some channels for every service we have. Every team member can join the channels of service that he/she owns or contributes to. By sending an error budget alert to this workspace, our team will acknowledge it faster and have a better insight into what happened in our services since there is some alert from our observability tools too.

HappyFresh order service alert
Error budget alert on Slack

This is Part 2 of the SLO Implementation Series at HappyFresh. Leave a 👏 if you enjoyed reading it. We’re also hiring engineers to join us in helping households around South East Asia to get their groceries easily. If you want to know more, visit HappyFresh Tech Career.

--

--

Riky Lutfi Hamzah
HappyTech

Engineering Manager — Reliability & Security at HappyFresh. Writing some thoughts at rilutham.com.