Alerting on High and Low Traffic Using SLO

Petr Hájek
Omio Engineering
Published in
7 min readNov 4, 2021

We Like a Challenge, but…

At Omio we have a ton of booking integrations with partners, and those integrations vary in traffic quite a bit. Some generate hundreds of bookings per hour while some generate units of bookings per day. Add to this the differences in peak (day-time) vs off-peak (night) traffic and monitoring errors can get messy, fast. We used to have tons of alerts for each integration, but it proved unwieldy, producing large numbers of false-alerts and often missing major incidents that required our attention.

So, we sought to create an alerting setup that would meet the following requirements:

  • Be actionable (no false-alerts)
  • Trigger on short term & long term issues
  • Work with low-traffic & high-traffic integrations
  • Work during peak & off-peak hours

Working from The Good Book.

We decided to use Google SRE methodology from the SRE Workbook, as a foundation. This solved many of our requirements out-of-the-box. Here’s what that looked like.

The first order of business, here, is establishing what constitutes an actionable alert. Should such an alert trigger in the case of a 30% error-rate across 5 minutes? Or 80%? That depends. Without some sort of long-term expectation, the numbers themselves don’t mean a whole lot. So we used the SRE Workbook concept of SLO (Service Level Objective), which is the success rate (1 minus ”error rate”) we want to have at the end of the month (we work monthly, but one could slot in any unit of time). With SLO comes the concept of Error Budget — the threshold of monthly errors we have to stay within to meet SLO (e.g. in the case of 99% SLO, 100k bookings/month, the Error Budget is 1k).

Within this framework, an actionable alert is one that triggers if SLO is endangered.

Triggering on short & long term issues

First of all, why trigger on long term issues? Aren’t large outages across shorter intervals the bigger problem? Not really, no. Let’s say we have an integration with 100k events/month (~140 events/hour), and 99% SLO (1k Error Budget). A 30 minute full outage with a 100% error rate would cause 70 events to fail, eating 7% of our Error Budget. An error rate of 5% across 6 days causes ~1k events to fail. And like that, the whole Error Budget is toast. So both matter. Each consumes the Error Budget at a different pace. That pace is represented by a constant called Burn Rate.

Burn Rates in Multiple Time-Windows

When it comes to capturing short and long term issues, the SRE Workbook suggests using 3 time-windows. Each of them applies a different Burn Rate. The configuration might look like this for 99% SLO:

  • Beyond 2 hrs: 7.2% Error Rate (= 7.2 * (1-SLO))
  • Beyond 12 hr: 3% Error Rate (= 3 * (1-SLO))
  • Beyond 3 days: 1% Error Rate (= 1 * (1-SLO))

The formula for error rate is Burn Rate* (1 — SLO), e.g. 7.2 * (1-SLO).

Since the Burn Rate is different for each window, we use different Alerting Channels based on urgency:

  • 2 hrs: On-call duty
  • 12 hr: On-call duty
  • 3 days: Slack notification/Jira ticket

Determining Appropriate Alerting Speed

Is this sort of alerting fast enough? That can be calculated pretty simply, using formulas from the SRE Book. For our model (99% SLO, 100k bookings/month) the behaviour of each window in the case of full outage (100% error rate) is:

The calculation above indicates when that first 2hr window would trigger an alert in 8 min. The 12hr window would trigger 13min later, and the 3 day window would trigger at just north of 43min. So, let’s take a look at the behaviour of the whole solution (working from just the first alert case).

This framework does a great job capturing issues that affect SLO, regardless of their size.

Speed of going to OK state

The aforementioned alerts pose a small problem, in that they continue in an alerting state for a longer than is necessary or even practical (e.g. a whole 3 days, even for an incident that lasted 5 hours). In order to compensate for this, the SRE Book suggests using sub-windows. Each window is broken out into main (e.g. 12h) and secondary (e.g. 1h) sections. To alert, the Error Rate has to exceed the limit in both sub-sections.

Unfortunately, this is where detailed advice from the SRE Book ends, so we had to devise our own solutions specific to our use-case.

Solving the Quandary of Low & High-Traffic Integrations

For high traffic services (1000 events/hour or more) the aforementioned solution works right out of the box. The SRE Book even suggests working with shorter windows (1hr, 6hr, 3d). Our booking integrations can vary dramatically; some generate relatively low traffic — from hundreds of bookings per hour to just a handful per day. This poses a problem. To calculate a reasonable error rate, you need a certain volume of bookings. You could have a 50% error rate due 5 failures out of 10, but that could simply be one persistent user retrying over and over.

What we arrived at was having different time-window combinations based on the traffic of specific integrations. The largest integrations have 3 windows (2hrs/6hrs/3d) while the smallest integrations have 2 windows (10d/30d). Rounding things out, we have a few combinations for mid-size integrations.

Limiting SLO Based on Traffic

Obviously, the name of the game is limiting the maximum possible SLO. So, while our largest integrations might have 99%+ SLO, smaller ones (10 bookings/day) have a limit of 80%. This is important for a few reasons. First, if we don’t have enough bookings in the window, you can’t alert on high SLO. Imagine SLO of 99%, 30d window, 10 bookings/day (300 bookings per month). In that scenario, 3 failures would trigger an alert. But again, that could be just one user retrying. Second, high traffic integrations have much higher user impact. So the goal is to have as high an SLO as possible there and then get alerted on small integrations if there is customer impact comparable to our larger integrations.

Small Integrations Don’t Hit the Pager

Given the SLO framework, we’ve set on-call duty on middle-to-big integrations, while the small integrations have Slack message/Jira ticket alerts only. This is down to priority when it comes to the action required.

Peak & Off-Peak Considerations

At Omio the majority of booking integrations have big differences between peak and off-peak hours. For example an integration could have 100 bookings per hour during the day, with less than 10 bookings per hour at night.

Disabling alerting at night isn’t an option, because having an unspotted outage for several hours at night causes major customer impact and might cause SLO to fail. On the other hand, full outage at night (10 bookings per hour) has a much lower impact than during daytime (100 bookings/hour). So, we disabled alerting on the shortest window at night.

We calculated the impact of night outage to the Error Budget. See table below for our example (99% SLO, 100k bookings/month) in case of full outage for 1 hour:

It was clear that the shortest window (2h) should be disabled at night, but the 12h window should still trigger On-call duty. The image below shows 2 alerts triggering On-call duty.

Too Few Bookings? Ignore the Window.

For the sake of simplicity we require at least 50 bookings in a time window period, otherwise the window is ignored.

Technical solution

A key aspect of this solution is not building anything custom (like custom alerting), but reusing existing alerting tools already on the market. We implemented the aforementioned alerting using Graphite source and Grafana for Dashboards. For such complex Graphite queries it’s necessary to use a templating platform — we use Terraform. For storing SLOs and other configurations we use Directus CMS.

In Summary.

Using SLO-based alerting, we can be confident that when an alert triggers, it’s not a false alarm; it marks something that actually needs attention. We’ve also started to notice issues we had previously overlooked. The key step forward for us was adopting the SLO approach (it actually offers much more than alerting, but that’s another article entirely). Once we fine-tuned improvements to crack the issue of low/high traffic integrations, we arrived at a reliable tool that increases our efficiency. We also realised that having reliable and precise alerting allows us to increase the use of alerts, which opens up new terrain for greater production quality.

--

--