Our journey towards SLO based alerting: Advanced SLO infrastructure based on slo-exporter

Seznam.cz DevOps
8 min readSep 22, 2020

--

by František Řezníček, David Vávra, Martin Chodúr, Rudolf Thomas and Lukáš Svoboda

As our first blogpost explained, we have on one hand succeeded to implement SLOs for our product but on the other hand we struggled with multiple urgent issues.

This blog post continues where the first ended and describes SLO implementation enhancements, introduces the open-source slo-exporter application and finally discusses the results.

SLO infrastructure evolution

Our SLO implementation changes were directly driven by the issues mentioned in the previous article and resulted in an optimized architecture. Let’s discuss the most important changes one by one.

Brand new system event processing

We revisited the initial decision to implement SLO using application metrics in Prometheus which limited us by cardinality issues and computation complexity.

We decided to measure traffic closer to our end users (on the system’s edge) by processing individual events using slo-exporter. This infrastructure change led to the following improvements:

Dedicated slo-exporter instances then perform all the needed stream processing resulting in unified SLI metrics. Simply saying, the most complex and cumbersome regular expressions used in SLO classification moved from Prometheus recording rules into slo-exporter configuration.

Real-time SLO incident root cause analysis

Soon after the introduction of SLO-based alerting we understood the need to quickly find which particular events (applications and endpoints) were causing the SLO error budget depletion. This was not achievable with the initial Prometheus-only SLO implementation. As the months with SLO passed by, the need for realtime SLO root cause analysis culminated.

Slo-exporter provides SLI metrics with just enough labels to identify the applications and their endpoints which are behind the SLO error budget depletion. Given the lower cardinality, we can easily compute the ad hoc error budget burn rates broken down to application’s endpoints. Real-time root cause analysis rapidly reduces SLO incident resolution time.

SLOs configuration maintenance and ownership

Centralized SLOs configuration for all SLO domains did not work well in our case. Developers frequently forgot to update a “foreign” repository and did not pay enough attention to details.

As a reaction to the described ownership issues we de-centralized the SLOs configuration into multiple SLO domain configuration repositories. This change helped to establish optimal SLOs ownership:

  • SLOs evaluation and alerting owned by an operations team,
  • SLOs (domain) configuration owned by the dedicated product manager and development teams.

These SLOs configuration ownership changes distributed configuration among the key people but did not reduce the overall configuration maintenance effort such as keeping the SLOs configuration up-to-date with evolving applications.

Therefore we allow applications to perform the SLO classification themselves. The result of the classification is then attached to the event, for example as a dedicated HTTP (response) header. Slo-exporter has a choice to trust the applications over the static SLO classification. Developers can control the importance and quality demands for specific functionality directly in the application codebase.

Transparent SLO corrections

Previously SLO corrections were not easy to make because of small overlapping pre-aggregated windows of rate/increase. The new SLO architecture allows us to precisely adjust absolute event counts that fall under a specific SLI. SLO corrections then take place at the lower recording-rule layer making them more straightforward as shown in the following example.

Introducing slo-exporter

Slo-exporter is an open-source golang modular stream-processing application helping to measure SLOs for a product.

The architectural view

The end user requests and responses flow through the system edge components. Slo-exporter instances observe and stream-process every system event according to their configuration (shown in the following picture as the dark blue dashed rectangle). Each slo-exporter instance then exposes SLI metrics. Monitoring infrastructure (Prometheus) scrapes SLIs and evaluates SLO error budgets and error budget burn rates based on multiple lightweight recording rules. SLO metrics are then used for visualization (Grafana) and alerting (AlertManager, OpsGenie).

SLO framework overview

Slo-exporter basics

Slo-exporter is a modular stream-processing application computing and exposing SLIs. System events are obtained by an ingester module (tailer, prometheus ingester) producing internal slo-exporter events. The internal events are usually filtered and normalized (relabel module), classified into SLO domains and SLO classes (multiple classifier modules), evaluated to be successful or failed (SLO event producer module) and finally exported (Prometheus exporter module) for additional possible SLO evaluation.

The stream-processing pipeline is configured in the main configuration file.

Let’s discover slo-exporter pipeline tasks in more detail following the processing path as shown at the diagram.

Obtain input system events

Slo-exporter is currently able to get system events from one of two sources

  1. an arbitrary log file using tailer module,
  2. Prometheus queries using prometheus ingester module.

Additional event sources such as Kafka and Envoy gRPC access logging are being evaluated.

The tailer module follows a log file and parses every record (line) with a configurable regular expression in order to extract needed response metadata such as duration, status code, IP address, HTTP protocol headers and so on. Once the logged response is parsed into an internal hash-like object (from now on referred as slo-exporter event) which is then sent to the rest of the slo-exporter pipeline.

There are situations such as computing non-request driven SLIs where system events are not present in the logs but available as application metrics. In such situations we may want to use the alternative prometheus ingester module to receive system events from applications via Prometheus.

Receiving input system events (from logs and metrics)

Filter and normalize system events

At this point we have our system events transformed into slo-exporter events by a producer module. Next, we want to filter and normalize the slo-exporter events using the relabel module. Usual filtering and normalization use-cases include:

At the end of this phase we want to attach the normalized event name (the so called event-key) using the Event Key generator module to every slo-exporter event.

Classify system events

We should classify valid and normalized system events into the following categories:

  • SLO domain
  • SLO class
  • application the event belongs to

Slo-exporter supports several system event classification approaches:

  • Trust the SLO classification sent along with the system event by the applications. This is done via the Metadata classifier module which simply reuses the existing system event metadata.
  • Use CSV configuration file(s) to the dynamic classifier module. The classification configuration may be provided as a set of exact matches and/or regex matches of the normalized event name and ties together normalized slo-exporter event name (event-key), SLO domain, SLO class and the application.
  • The statistical classifier observes the distribution of already classified events. Any unclassified events are then statistically classified based on a weighted guess based on the observed distribution. This method is intended to be used as a fallback for the approaches above.

Evaluate system event quality

Valid, normalized and classified system events are transformed into SLI increments by the SLO event producer module. Just as a system event may contribute to multiple defined SLIs, so the SLO event producer module may generate multiple SLI increments from a single system event too.

The transformation uses SLI failure conditions to split failed and successful system events to be able to compute the SLI computation equation later.

The SLO event producer module reads a dedicated configuration file which defines SLI failure conditions as well as additional SLI increments metadata:

  • SLO version
  • SLO type (availability, latency, quality…)
  • event result (indicates whether the newly created event is to be considered successful)
  • normalized slo-exporter event name (event-key)

Export SLI metrics

The final stream-processing stage accumulates SLI increments and exports SLIs for further SLO visualization and alerting. Slo-exporter currently uses the Prometheus exporter module to export SLIs as prometheus runtime metrics. The most important SLI runtime metrics are:

  • slo_domain_slo_class:slo_events_total, SLI metric used for SLO, SLO error budget and SLO error budget burn-rate computations and alerting,
  • slo_domain_slo_class_slo_app_event_key:slo_events_total, SLI metric with higher cardinality, providing detailed view down to the application’s normalized event name. Used in the SLO root cause analysis functionality.

The following picture shows three system events observed in a proxy access log which are consequently transformed into SLI metrics using simple HTTP status-code SLI failure condition.

Alternative SLI exporting approaches (TimescaleDB, ElasticSearch, …) are currently being evaluated.

SLO evaluation and alerting

Although slo-exporter SLI metrics are enough to implement SLO visualization and alerting, we suggest to use recording rules to speed the SLO visualization up and support SLO corrections.

The remaining computation part in Prometheus performs the following actions:

  1. Aggregation over SLO rolling window (4 weeks)
  2. Calculate ratio of failed events (violation-ratio) including existing SLO corrections
  3. Calculate SLO error budget
  4. Calculate multi-window SLO error budget burn rates
  5. Alert on SLO error budgets and their burn rates

Key takeaways

Decentralization of SLO configuration clarified the ownership and optimized the maintenance among product, development and operations teams. The easier SLO incident root cause analysis (down to the application’s endpoint) reduced the average incident resolution time and improved overall SLO framework user experience.

SLO architecture improvements led by the new slo-exporter event processing not only addressed the pending SLO issues but also made the SLO framework more robust and flexible. Last but not least, we believe that our effort in making slo-exporter open-source will help you to implement the SLOs for your product.

--

--