Our journey towards SLO based alerting: Implementing SRE workbook alerting with Prometheus only

Seznam.cz DevOps
8 min readSep 10, 2020

--

by František Řezníček, David Vávra, Martin Chodúr, Rudolf Thomas and Lukáš Svoboda

In a series of blog posts, we aim to present our ongoing path of introducing Service Level Objectives (SLO) based alerting in our company — Seznam.cz, particularly on its advertising platform Sklik. As of September 2020, it has been 2 years since we started introducing SLO and SLO-based alerting in our company. We feel that at least some (if not most) of our experience may help you in case you are considering introducing SLO into your organization just now, or if you have just started to do so and you are not sure what approach to choose.

SRE workbook describes how Google implements the SLO alerting and was a huge inspiration to our SLO implementation. As we also share the same terminology, make sure to read at least its chapters on SLO to better understand the following text.

In this blog post, we are going to focus on what led us to SLOs, what were the prerequisites for the initial implementation and what challenges we have faced. The following blog posts will focus on the evolution of our SLO computation in more detail.

Motivation for SLO implementation

Before we dive any further, let’s spend some time explaining our historical context. We manage approximately 50 services in several Kubernetes clusters, serving about 600 endpoints in total. As we already had a Prometheus (Thanos) monitoring stack with defined standards for application metrics, basing Prometheus alerts on the rate of application’s errors was a logical choice. However, this turned out to be a rather naive choice — we were experiencing a lot of false-positives. Usually, a single failure led to a myriad of alerts for every single component that was affected by the infrastructure, database or, for example, a central component failure. Dealing with those problems eventually led us to introducing SLO, which would bring us the benefit of:

  • having a widely understandable measure of system reliability through time (error budget), together with defined procedures in case the quality suffers (error budget policy document)
  • observing the system condition on the edge, as close to the user as possible
Alerting before SLO introduction

Theory and terminology

We have divided the whole system into several SLO domains, which represent separate products or a standalone part of a single (big) product (e.g. userportal, userportal-api, userportal-import, userportal-export). SLO domain effectively limits the scope of our error budget policy.

For every domain, we identify Service Level Indicators (SLIs) which are the most important for our users. Availability and latency are the most common SLI types, but we also make use of freshness, quality and coverage.

As the last level of partitioning, several SLO classes are defined for each SLO domain and SLI type — in order to group events sharing the same quality requirements together.

It may sound confusing in such abstract definitions, so let’s follow with an example. In Sklik, users can interact with the system either via website or API. Thus we have defined two SLO domains — userporal and useportal-api. For both of them, we decided to track availability and latency SLI types, and defined four SLO classes according to the SRE workbook critical, high_fast, high_slow and low.

Error budget (and associated error budget policy) refers to a 4-week floating window. For alerting (described as Multiwindow, Multi-Burn-Rate Alerts in the SRE workbook), we use various combinations of the following time ranges: 5m, 30m, 1h, 2h, 6h, 3d. See the following table for summary:

Initial implementation

Now, when you know how we decided to structure our SLO domains, let’s look at the fun part — how we filled them with data.

The already mentioned SRE workbook provides ready to use Prometheus rules and a description of the whole problem domain. It became our guide and since we were already using Prometheus (Thanos), it should have been easy to start with.

We had the option to use standardized application metrics, which were already implemented by all our applications (either natively, or using mtail and dedicated Lua module for Nginx). This is the most important one of them for our cause:

Computing SLO (success ratio) for availability and latency using this metric would be as simple as:

Availability — events with 5xx HTTP statuses considered as failed
Latency — events with processing duration ≥ 8s considered as failed

It is generally recommended to measure SLO as close to the users as possible. With the use of aforementioned application metrics, it wasn’t possible to use our edge Nginx proxies (as information about the called endpoint is not known to us in cases such as XML-RPC, GRPC, GraphQL) — we had to use a combination of edge applications’ metrics and Nginx proxy metrics as a fallback (for cases in which the traffic does not reach the downstream component).

Initial implementation of SLO alerting: using application metrics

Cardinality issues

Due to the rather large number of time series (in particular because of labels ‘endpoint’, ‘status_code’, ‘instance’ and ‘le’ for every latency histogram bucket), it was impossible for us to compute increase for the whole SLO period (4w) — our monitoring cluster was just not able to finish the rule evaluation. Because of that, we had to create multiple layers of Prometheus recording rules.

Basically we had one rule to compute 5m rate for every SLO type and SLO class, and then another set of rules which would compute sum_over_time for all desired SLO time ranges that we needed — 5m, 30m, 1h, 2h, 6h, 3d (for SLO burn rate alerts) and 4w (primarily for error budget).

Event classification

Next, we had to solve was the SLO events classification — for every SLO domain, SLO class and app, we had to create a single Prometheus recording rule that would match all the endpoints with that classification. Each such rule used a regular expression label selector.

Prometheus recording rule for app ‘web’, SLO class ‘low’ with regexp matching all endpoints with such classificaton

Who would want to maintain that? We wouldn’t, but unfortunately we had to. Therefore we developed a Jsonnet library dedicated to generate all the necessary recording rules for a given SLO domain (defined in a neat and simple configuration file).

Observing non-discrete SLIs

To observe non-discrete SLIs (such as latency), we use Prometheus histogram metric type. To get exact results, the chosen SLO thresholds have to align with the bounds of the metric’s buckets. In order to change the bucket bounds, the application itself has to be modified. On the other hand, having many bucket bounds just in case tends to increase the metrics’ cardinality.

We also ran into an issue where Prometheus client libraries in different languages used different formats for the histogram bounds label (le). This incompatibility can be generally solved by relabeling, but it is something which needs to be taken into account.

Missing data if no errors happened

Another difficulty showed up when our services performed flawlessly. With no failed events, the used PromQL equation would return no data. This would then result in seeing N/A instead of 100 % on SLO dashboards. To avoid this we used the Jsonnet library to generate Prometheus recording rules with 0 values for each SLO domain, type and class combination.

Versioning of SLO

During the whole SLO adoption, we often changed the computation, normalization, classification and thresholds. This led to significant differences between the old and the new results. To compare those and verify the correctness of the new computation, we added a new dimension to the aggregation: SLO version. Versioning allowed us to smoothly transition to a new computation. We would concurrently run both and flip to the new one after it has been running for the 4 weeks.

Normalization of traffic irregularities

Our traffic pattern

Since we are a regional service, our traffic suffers from regular troughs during nights and weekends. In addition to SRE workbook examples, we have enhanced our burn rate alerts with a dynamic multiplier (referred to as ‘events _rate_coefficient’ in the examples) in order to adjust the alerting threshold based on current traffic rate.

Result

Eventually we managed to get the SLO alerting working for the availability and latency SLI types on the Seznam.cz advertising platform Sklik. We maintained all the “old” component alerts along with the new SLO based alerts, but lowered their severity, so we do not get paged by them. We were happy with getting fewer pages and reducing the number of false-positives or page storms.

Finally, we were able to see how much our users were affected over the last 4 weeks. This led to a significant improvement of visibility of the product stability over time for all involved parties —developers, operations (SRE) and most importantly Product Owners which now were in charge of the desired level of quality of the product they are responsible for.

In general this was a success, but new types of problems and challenges showed up. The following list might serve as a heads-up — for anyone else trying to implement SLO-based alerting using just application metrics as data source.

Remaining challenges

  • Despite optimizations, we were still facing performance issues. In the worst case, SLO recording rules were failing to complete, resulting in gaps in graphs.
  • Classification of events was centralized in the Prometheus rules label selectors meaning that developers adding new events to the system had to classify them somewhere else. This often led to unclassified endpoints which we had to look after with custom alerts.
  • Sometimes, mainly in the early days of SLO adoption, invalid data have been included in SLO computation. This could have been caused by invalid events (e.g. traffic from internal users), misclassified events or mistakes in computation. In those cases, we needed to somehow “repair” the history of the error budget. As Prometheus is an append-only database, so there is not an easy way to do this. It is possible to introduce an additional recording rule which would add a negative value to compensate for the invalid data (and is valid only for a certain period of time), but this is cumbersome and very, very error prone.
  • While using standardized application metrics made it quite simple to start implementing the SLO according to the SRE workbook, it eventually turned out to be quite a burden. Any change in those metrics takes quite some time to agree upon and (then) to implement. Also, we are forced (by Prometheus’ TSDB data model) to keep metric labels cardinality to the necessary minimum. This implies that it may be impossible to consider for example user’s userAgent into SLI evaluation.
  • Observing non-binary state SLI types (e.g. latency) requires histogram type metrics which can lead to even bigger cardinality and still has the limitation of having thresholds as a mandatory subset of the histogram buckets.
  • Inability to classify requests, which, for some reason, don’t reach the destination service. As we are using services’ application metrics, we are not able to classify endpoints based just on metrics from the gateway/proxy.
  • Should the SLO burn rate alert fire, it’s not very straightforward for an on-call person to perform a per-app and per-endpoint drilldown in order to find out the actual cause responsible for the burnt error budget. The reason is that for performance reasons we omit endpoint from SLO recording rule evaluation.
  • Grouping events into SLO classes may cause problems with less-frequently called enpoints — they are not “significant enough” from the whole group’s perspective.

In the next article, we will describe in detail how we approached all of these problems using a tool which we have developed — slo-exporter.

--

--