Portrait of Adrien Bestel, Principal Ops Engineer @ tb.lx, with crossed arms and a black t-shirt. Title: Navigating Service Level Objectives Series — Third Part: The SLO Toolkit

Third Part: The SLO Toolkit

tb.lx

Published in

tb.lx insider

20 min readMay 13, 2024

Streamlined SLO Setup & Alerting with Pyrra

In our previous articles, we embarked on an insightful journey into the world of Service Level Objectives.

Navigating Service Level Objectives laid the foundation, and introduced the basics and significance of SLOs in modern software operations.

Then, The SLOs Playbook took a practical turn, applying these concepts to a simplified fleet management system. We delved into setting up SLIs, SLOs, and error budgets, using a REST API as our learning model. Our hands-on approach provided a solid understanding of the fundamentals of SLOs in a real-world context.

In this third installment of our series, we will pivot towards a more advanced and streamlined approach to SLOs. We’ll explore how Pyrra, a robust tool in the Prometheus ecosystem, revolutionizes SLO setup and alerting. This exploration will not only deepen our understanding of SLO intricacies but also demonstrate practical applications, ensuring our SLO strategies are both efficient and effective.

SLOs: The Complexity Within

However, with advancement comes new challenges. The cognitive load around SLO concepts is already significant. This load is compounded when faced with manually implementing metrics that allow monitoring SLOs, as understanding and writing complex queries can be intimidating. The queries can be error-prone and become a maintenance headache over time.

Moreover, the limitations of Prometheus’ data retention time further complicates long-term SLO tracking. The strain of heavy queries on resources further complicates the process, increasing the risk of errors.

These obstacles can impede the effective adoption, implementation, and management of SLOs, which are crucial for maintaining high-service reliability and customer satisfaction.

Conquering SLO Complexity: The SRE Quest for Efficiency

In this third article of our series, we tackle these challenges head-on, embracing the SRE ethos of reducing toil and automating manual tasks. Our goal is to streamline and simplify the SLO process.

Enter Pyrra, a tool specifically crafted to alleviate the challenges of SLO creation and management. By integrating Pyrra into the SLO Laboratory setup we introduced in our previous article, we’ll demonstrate its ability to simplify SLO configuration, overcome Prometheus retention challenges, and enhance query efficiency.

This article will then guide you in embedding SLOs into your incident management process, focusing on efficient alerting strategies. Gear up for a hands-on journey towards mastering SLOs more easily and effectively.

Pyrra: Revolutionizing our approach to SLOs

Our path to adopting Pyrra

Pyrra is an open-source project, aimed at making Service Level Objectives (SLOs) with Prometheus more manageable, accessible, and user-friendly. It has been a game-changer in our journey with Service Level Objectives.

We first heard about it on Grafana’s Big Tent Podcast, in the episode Why SLOs are MVPs in Observability, where one of Pyrra’s creators, Mathias Loibl, introduced it.

At that time, one of our teams was delving deep into SLOs. We described it in the first episode of this series:

Our journey to adopt SLOs started modestly by merely tracking the daily availability of our REST APIs.[…]
However, the real turning point came when one of our product teams was tasked with overhauling a legacy system that had become notorious for its instability. […]
For the team, the mission was two-fold. First, to understand the performance of different aspects of the legacy system [..]. Second, to bootstrap a brand-new platform that would not only match the feature set of the existing system but would also prioritize reliability from the get-go.
Implementing a comprehensive framework for SLOs became an essential part of this mission. […]

Our initial steps in SLO implementation involved manual, complex, and sometimes error-prone queries. While these provided valuable insights, they were cumbersome and slow. Pyrra entered the scene as a catalyst, accelerating our SLO adoption process. It simplified creating and managing SLOs, reduced the load on Prometheus, and introduced high-quality alerting. More importantly, Pyrra facilitated a cultural shift within our organization, emphasizing the significance of SLOs.

In the following parts of this section, we will explore Pyrra in depth; its feature set, how it simplifies SLO management, and its out-of-the-box alerting capabilities. This will lead us into the next section, which focuses exclusively on alerting based on SLOs.

Feature set of Pyrra

At its core, Pyrra is centered around Custom Resource Definitions (CRDs), representing Service Level Objectives. This means that each SLO is essentially a (simple) configuration file. Once set up, these configurations are processed by the Pyrra backend (either installed as a Kubernetes operator or as a filesystem runtime). This process seamlessly generates configurations for Prometheus, facilitating the heavy lifting involved in setting up SLOs, error budgets, alerting, etc. The beauty of Pyrra lies in its integration into your existing Prometheus workflow, enhancing its functionality without disrupting established processes.

Pyrra supports various SLO types, each tailored to specific monitoring needs:

Ratio SLOs
These focus on the success rate of requests, calculated as the proportion of error responses to total requests. It’s particularly useful in applications where success can be distinctly measured, such as API response rates.
Latency SLOs
These are essential for services where response time is critical. Using Prometheus histograms, latency SLOs help in tracking the percentage of requests that are served within a specified time frame.
Bool Gauge SLOs
Ideal for simpler, binary metrics like uptime, these SLOs are based on gauge metrics that typically have values of 0 or 1. They provide a straightforward way to monitor the operational status of a service.

Pyrra’s workflow: Synergy between Prometheus and Pyrra

The synergy between Pyrra and Prometheus streamlines SLO management. Pyrra serves as the bridge, translating SLO configurations into a format that Prometheus can process efficiently.

The diagram below shows the lifecycle of a ServiceLevelObjective custom resource in a Kubernetes workspace. These resources are installed in a Kubernetes cluster, then the Pyrra operator reconciles the ServiceLevelObjectives, and translates them, creating corresponding PrometheusRecordingRule custom resources, themselves reconciled by the Prometheus Operator (see Kube Prometheus).

This integration allows for a smooth transition of SLO data from definition to actionable metrics within Prometheus.

*Lifecycle of a Service Level Objective in Kubernetes*

Prometheus Recording Rules: A brief overview

Prometheus recording rules play a pivotal role in efficient SLO management with Pyrra. These rules do more than just periodically process heavy queries; they offer a strategic solution to Prometheus’ data retention limitations.

Periodic Query Processing
By evaluating complex queries at regular intervals, recording rules reduce the computational load. They pre-compute and store results as time series data, making them readily available for analysis.
Overcoming Retention Limits
A significant advantage of this approach is its ability to circumvent Prometheus’ retention time constraints. When recording rules evaluate metrics, they can work on the historical data available then, creating new data points. These points reflect aggregated or computed values over potentially vast time spans.
Facilitating Larger Windows for SLOs
This feature is crucial for SLOs that require monitoring over extended periods. By storing the results of these computations in new time series, Pyrra enables us to define SLOs over windows larger than what Prometheus’ live query capabilities can handle. In essence, it’s like extending the memory of our monitoring system, allowing us to make more informed decisions based on a broader range of historical data.

In summary, Prometheus recording rules, when used with Pyrra, not only streamline the query process but also expand the potential for long-term data analysis, which is vital for effective SLO management.

As we’ve explored so far, Pyrra stands as a powerful ally in our SLO journey, offering a refined way to define and manage Service Level Objectives within the Prometheus ecosystem. With its ability to handle different types of SLOs and integrate seamlessly into our existing monitoring setup, Pyrra has shown us the theoretical potential to revolutionize our approach to SLOs.

However, the true test of any tool lies in its application. It’s one thing to understand the capabilities of Pyrra in theory but quite another to see it in action, transforming our SLO strategies from concepts into tangible results. This is where we transition from the ‘what’ and ‘why’ of Pyrra, to the ‘how’.

Implementing SLOs: A Practical Guide with Pyrra and Prometheus

In this section, we revisit the simple fleet management system introduced in our previous article, The SLO Playbook. This real-world example serves as our laboratory for implementing and understanding SLOs in a practical context.

Just like before, our setup is simple and requires Docker and Docker Compose. It includes a monitoring stack with Prometheus and Grafana, a Spring Boot-based REST API backed by PostgreSQL, and k6 for load testing. This environment provides us with the necessary tools and metrics to effectively implement and monitor our SLOs.

In this exercise, we’re going to apply what we’ve learned so far but with a twist; we’ll be using Pyrra in conjunction with Prometheus to create and manage our SLOs. Our focus will be on setting up two specific types of SLOs using Pyrra:

Ratio SLO: We aim for 99% of requests to the REST API to be successful (status code not in the 5xx range) over a period of 4 weeks. This SLO will help us ensure the high availability and reliability of our API.
Latency SLO: Our goal is for 99% of successful requests (status code in the 2xx range) to the REST API to be processed in under 50ms, also over a 4-week window. This SLO is critical for maintaining a responsive, fast user experience.

By the end of this section, you’ll see how Pyrra simplifies the process of defining, managing, and visualizing these SLOs, providing a more streamlined, efficient approach compared to using Prometheus directly.

To get started, a few simple commands are all it takes:

# Kickstart the monitoring stack
docker compose up -d prometheus
docker compose up -d grafana
docker compose up -d pyrra-api
docker compose up -d pyrra-filesystem

# Fire up the REST API
docker compose up -d rest-api --build

# Launch load tests against the REST API
docker compose run --rm k6

Building an Availability SLO

To establish a Ratio SLO in Pyrra, we specify two key Prometheus queries: the error rate and the total request rate. Reflecting on our previous article, we crafted the following queries for this purpose:

Error rate (daily)

sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/operators.*",
      status=~"5.."
    }[1d]
  )
) or vector(0)

Total Request Rate (daily)

sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/operators.*"
    }[1d]
  )
) or vector(0)

Utilizing these, we can define a Ratio SLO custom resource in Pyrra as follows:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: rest-api-availability
  namespace: rest-api
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: '99.0'
  window: 28d
  indicator:
    ratio:
      errors:
        metric: http_server_requests_seconds_count{uri=~"/operators.*", status=~"5.."}
      total:
        metric: http_server_requests_seconds_count{uri=~"/operators.*"}

It’s important to note that Pyrra simplifies the configuration by not requiring the time window or operators like rate or sum in the queries. Pyrra intelligently processes these raw counters into practical queries for SLO monitoring.

The resulting recording rules from this SLO definition are the following (the output has been simplified):

groups:
- interval: 2m30s
  name: rest-api-availability-increase
  rules:
  - expr: sum by (status, uri) (increase(http_server_requests_seconds_count{uri=~"/operators.*"}[4w]))
    labels:
      slo: rest-api-availability
    record: http_server_requests_seconds:increase4w

- interval: 30s
  name: rest-api-availability-generic
  rules:
  - expr: "0.99"
    labels:
      slo: rest-api-availability
    record: pyrra_objective
  - expr: 2419200
    labels:
      slo: rest-api-availability
    record: pyrra_window
  - expr: 1 - sum(http_server_requests_seconds:increase4w{slo="rest-api-availability",status=~"5..",uri=~"/operators.*"}
      or vector(0)) / sum(http_server_requests_seconds:increase4w{slo="rest-api-availability",uri=~"/operators.*"})
    labels:
      slo: rest-api-availability
    record: pyrra_availability

The following table explains the meaning of these rules:

The error budget can then be expressed as:

(pyrra_availability - on (slo) pyrra_objective)
/  on (slo)
(1 - pyrra_objective)

These recording rules offer significant advantages:

Pre-computing potentially heavy queries over a 4-week period.
Simplifying and automating the expression of availability, which can often be complex and error-prone in manual setups.

Building a Latency SLO

After successfully establishing an Availability Ratio SLO, we now turn our attention to a different but equally crucial type of Service Level Objective: the Latency SLO. This SLO type is a bit more nuanced due to its close association with histograms in Prometheus, offering a unique approach to monitoring service responsiveness.

In contrast to ratio-based SLOs, Latency SLOs are deeply intertwined with the concept of histograms. Histograms in Prometheus are powerful tools that group data into buckets, each representing a range of response times. For our Latency SLO, we focus on two key aspects:

Success Counter: This represents the count of requests that are faster than a specific threshold. The threshold is determined by the le (less than or equal to) label in the histogram counters. Essentially, this metric helps us track the proportion of requests meeting our defined performance standard.
Total Counter: Like our previous SLO, this tracks the total number of requests. It provides the denominator for our SLO calculation, allowing us to determine what percentage of total requests meets our latency criteria.

By defining these two queries, we can effectively measure the latency performance of our service. We aim to ensure that a high percentage of requests are processed within our desired time frame, reflecting a responsive and reliable system.

Reflecting back on our previous article, we crafted the following queries for this purpose:

“Fast” Request rate (daily)

This metric counts requests processed faster than our set threshold (in this case, 50ms), which is crucial for evaluating how many requests meet our performance standards.

sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/operators.*",
      status=~"2..",
      le="0.05"
    }[1d]
  )
) or vector(0)

Total Request Rate (daily)

This provides the total number of requests, serving as a comparison point to determine the percentage of “ fast requests.”

sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/operators.*"
      status=~"2..",
    }[1d]
  )
) or vector(0)

Using these queries, we define our Latency SLO in Pyrra:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: rest-api-latency
  namespace: rest-api
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: '99.0'
  window: 28d
  indicator:
    latency:
      success:
        metric: http_server_requests_seconds_bucket{uri=~"/operators.*", status=~"2..", le="0.05"}
      total:
        metric: http_server_requests_seconds_count{uri=~"/operators.*", status=~"2.."}

The resulting recording rules are similar to the Availability SLO ones:

groups:
- interval: 2m30s
  name: rest-api-latency-increase
  rules:
  - expr: sum by (status, uri) (increase(http_server_requests_seconds_count{status=~"2..",uri=~"/operators.*"}[4w]))
    labels:
      slo: rest-api-latency
    record: http_server_requests_seconds:increase4w
  - expr: sum by (status, uri) (increase(http_server_requests_seconds_bucket{le="0.05",status=~"2..",uri=~"/operators.*"}[4w]))
    labels:
      le: "0.05"
      slo: rest-api-latency
    record: http_server_requests_seconds:increase4w

- interval: 30s
  name: rest-api-latency-generic
  rules:
  - expr: "0.99"
    labels:
      slo: rest-api-latency
    record: pyrra_objective
  - expr: 2419200
    labels:
      slo: rest-api-latency
    record: pyrra_window
  - expr: sum(http_server_requests_seconds:increase4w{le="0.05",slo="rest-api-latency",status=~"2..",uri=~"/operators.*"}
      or vector(0)) / sum(http_server_requests_seconds:increase4w{le="",slo="rest-api-latency",status=~"2..",uri=~"/operators.*"})
    labels:
      slo: rest-api-latency
    record: pyrra_availability

In this setup, we pre-compute two metrics over a 4-week period: the total requests and the “fast” requests. This distinction is key in Latency SLOs, allowing us to monitor not just the volume of requests but their speed, ensuring that our system’s responsiveness aligns with our defined targets.

Visualizing Success: Monitoring SLOs with Grafana and Pyrra

Having successfully configured our Availability and Latency SLOs with Pyrra, we now move to a critical aspect of SLO management: effective monitoring. It’s worth noting that setting up SLOs in Pyrra is notably simpler and more streamlined than manual configurations directly in Grafana and Prometheus. This ease of use is a significant advantage, allowing us to focus more on analysis and less on setup complexities.

The following section will delve into the specifics of monitoring SLOs using Grafana. By integrating Pyrra’s SLO configurations into Grafana dashboards, we can transform complex data into meaningful, actionable insights. This integration not only simplifies the monitoring process, but also ensures that our teams can easily track, understand, and react to our service’s adherence to the set objectives.

Integration with Grafana

Pyrra’s integration not only simplifies the creation of SLOs but also their monitoring, leveraging Grafana’s robust visualization capabilities. The generic metrics set up by Pyrra allow the creation of generic dashboards allowing to monitor SLOs as they are created and processed by Pyrra.

Such dashboards can be found in the SLO Laboratory repository (see here). They are variants of the base dashboards provided by Pyrra (see here).

List of SLOs

This dashboard is your operations command center, presenting a comprehensive list of all your configured SLOs. At a glance, you can discern the current status of each SLO, including objectives, time windows, availability, error budgets, and active alerts. It’s designed to provide immediate insights into the health of your entire system.

SLO Details

Delve into the specifics with the detail dashboard. Here, you’ll find a deep dive into individual SLOs, showcasing the objective, window, real-time availability, and error budget. Historical graphs plot the trajectory of the error budget, request rate, and error rate over time, offering valuable insights into trends and enabling informed decision-making about balancing reliability with feature development.

*SLO Laboratory — Details of the Latency SLO*

*SLO Laboratory — Details of the Availability SLO*

Error Budgets Overview

To capture the system’s long-term health, the overview dashboard visualizes the fluctuation of all SLO error budgets across a broad timeline. Ideal for periodic review meetings, it allows teams to assess the state of SLOs collectively and make strategic adjustments to ensure service reliability remains aligned with business goals.

Wrapping up our discussion on monitoring, it’s clear that Pyrra and Grafana make a great team. Together, they make it easier to keep an eye on our SLOs and help us stay on top of our game. Now, we’re ready to tackle alerting — a key part of keeping services reliable. Pyrra gives us alerting out of the box, allowing to respond quickly when things don’t go as planned.

Proactive Reliability: Mastering SLO Alerting with Pyrra

In this next section, we will delve into the art and science of alerting on SLOs. Based on the wisdom from Google’s SRE workbook, we’ll explain the intricacies of setting up effective alerting strategies.

We’ll then explore Pyrra’s native alerting capabilities and demonstrate how these can be replicated within Grafana.

This section will not only guide you through configuring alerts, but also illustrate how to maintain a balance between alert sensitivity and actionable responses, ensuring that your teams can maintain the highest standards of service reliability.

Alerting on SLOs according to Google SRE

Before we get into the specifics of setting up alerts based on SLOs, it’s a good idea to give a quick rundown of the core concepts as outlined by Google’s SRE team. For a deeper dive, their work on alerting on SLOs is a must-read.

Google’s SRE workbook outlines that alerting considerations should balance four key attributes:

Precision: the proportion of events detected that were significant (i.e. false positives should be minimized).
Recall: The proportion of significant events detected (i.e. false negatives should be minimized).
Detection Time: How long it takes to send notifications.
Reset Time: How long alerts fire after an issue is resolved.

An effective setup for SLO-based alerting, aims to identify significant deviations in the error budget, which are captured by monitoring the burn rate. Burn rate is how fast, relative to the SLO, the service consumes the error budget. The burn rate is calculated as:

Equation: Burn Rate= (Error Rate)/(100%-SLO)

A burn rate of 1 means that the error budget is completely consumed in the time window of a SLO. In other words, for a 99.9% SLO over a 30-day window, a constant error rate of 0.1% would consume the error budget in 30 days.

Another important aspect of the burn rate is that it can help us define the time to exhaustion of the error budget, effectively modeling, “How long would it take to breach my SLO if the burn rate stays constant?”.

Equation: Time to exhaustion= (SLO Window)/(Burn Rate)

Keeping our example of a 99.9% SLO over a 30-day window, the table below shows burn rates, their corresponding error rates, and the time it takes to exhaust the SLO budget.

It is then possible to model significant events — events where the service’s error budget is consumed at an unexpectedly high rate, which could signal a degradation in service quality — by checking the burn rate over different time windows. The recommendation from Google SRE is to alert on these configurations for a 30-day SLO window:

High severity alert: 2% error budget consumption in 1 hour;
High severity alert: 5% error budget consumption in 6 hours;
Low severity alert: 10% error budget consumption in 3 days.

These configurations allow alerting on sudden high consumption of the error budget, but also on slow burn of the error budget that is hard to spot on smaller time windows.

To match these configurations with their burn rates, the following calculation is used:

Equation: Burn Rate= (Error Budget Burned)/(Percentage of SLO Window elapsed)

For example, for a 30-day SLO window, the burn rate associated to “2% error budget consumption in 1 hour” is:

Equation: Burn Rate= (2%)/((1 (hour))/(30 (days)×24 (hours) ))=14.4

To avoid false positives, Google also introduces the concept of short and long windows: we only want to alert on the recommended configurations while we are still actively burning through the budget. A short window is added to the previous configurations for that purpose, with a suggested size of 1/12th of the origin window. The final configurations for SLO-based alerting for 30-day SLOs are listed in the table below:

Alerts on these conditions can be expressed as:

Equation: burnRate[shortWindow] >threshold ∧ burnRate[longWindow] >threshold ⇔errorRate[shortWindow]/(100%-SLO)>threshold ∧ errorRate[longWindow]/(100%-SLO)>threshold ⇔errorRate[shortWindow]>threshold× (100%-SLO) ∧ errorRate[longWindow]>threshold× (100%-SLO)

Implementing these alerts using Prometheus Query Language would look like:

(
    errorRateMetric[shortWindow] > (burnRate * (1 – SLO))
  and
    errorRateMetric[longWindow]  > (burnRate * (1 – SLO))
)

Let’s Implement such an alert (2% of the error budget burned in 1 hour), on our SLO Laboratory REST API, using our Availability SLO stating that 99% of requests should succeed over a 4-week period.

The corresponding burn rate is:

Equation: Burn Rate= (2%)/((1 (hour))/(28 (days)×24 (hours) ))=13.44

The error rate over a certain window can be calculated as:

sum(
  rate(
    http_server_requests_seconds_count{
      status=~"5..",
      uri=~"/operators.*"
    }[window]
  )
)
/
sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/operators.*"
    }[window]
  )
)

The full alert would then look like:

(
  sum(
    rate(
      http_server_requests_seconds_count{
        status=~"5..",
        uri=~"/operators.*"
      }[5m]
    )
  )
  /
  sum(
    rate(
      http_server_requests_seconds_count{
        uri=~"/operators.*"
      }[5m]
    )
  )
) > (13.44 * (1 – 099))

and

(
  sum(
    rate(
      http_server_requests_seconds_count{
        status=~"5..",
        uri=~"/operators.*"
      }[1h]
    )
  )
  /
  sum(
    rate(
      http_server_requests_seconds_count{
        uri=~"/operators.*"
      }[1h]
    )
  )
) > (13.44 * (1 – 099))

We’ve now covered the essentials of alerting on SLOs as Google’s SRE team advises. These strategies hinge on the burn rate — a vital sign of our service’s health. While this methodology is powerful, it’s not without its challenges. Crafting the right queries to monitor SLO burn rates can, once again, be complex, error-prone, and potentially heavy on the monitoring infrastructure. But fear not, as Pyrra comes to our aid, simplifying these complexities and enabling us to focus on what really matters — maintaining the reliability of our services.

Streamlined Alerts: Leveraging Pyrra for Efficient SLO Monitoring

Pyrra takes the heavy lifting out of SLO alerting by integrating with Prometheus’ alerting rules. When setting up a new SLO, Pyrra not only generates recording rules to track error rates across multiple timeframes but also crafts multi-window, multi-burn rate alerts. This automated approach ensures that you’re alerted based on the most relevant data, without the hassle of manual configuration.

For instance, consider the Latency SLO we configured earlier:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: rest-api-latency
  namespace: rest-api
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: '99.0'
  window: 28d
  indicator:
    latency:
      success:
        metric: http_server_requests_seconds_bucket{uri=~"/operators.*", status=~"2..", le="0.05"}
      total:
        metric: http_server_requests_seconds_count{uri=~"/operators.*", status=~"2.."}

We omitted part of the recording rules in the previous section to avoid introducing too much complexity at once. Pyrra sets up a series of recording rules that keep an eye on error rates across different time windows, such as:

- interval: 30s
  name: rest-api-latency
  rules:
  - expr: (sum(rate(http_server_requests_seconds_count{status=~"2..",uri=~"/operators.*"}[5m]))
      - sum(rate(http_server_requests_seconds_bucket{le="0.05",status=~"2..",uri=~"/operators.*"}[5m])))
      / sum(rate(http_server_requests_seconds_count{status=~"2..",uri=~"/operators.*"}[5m]))
    labels:
      slo: rest-api-latency
    record: http_server_requests_seconds:burnrate5m

This recording rule is evaluated every 30 seconds and computes the error rate of our Latency SLO (ratio between slow requests and total requests) over 5 minutes windows, then saves it in a new time series called http_server_requests_seconds:burnrate5m.

For a 4-week window SLO, Pyrra configures 7 recording rules to pre-compute the error rate over the following time windows:

5 minutes
30 minutes
1 hour
2 hours
6 hours
1 day
4 days

These new time series are then used in multi-window multi-burn-rate alerts configured using alerting rules such as the one below:

- alert: ErrorBudgetBurn
    expr: http_server_requests_seconds:burnrate5m{slo="rest-api-latency"} > (14 *
      (1-0.99)) and http_server_requests_seconds:burnrate1h{slo="rest-api-latency"}
      > (14 * (1-0.99))
    for: 2m
    labels:
      exhaustion: 2d
      long: 1h
      severity: critical
      short: 5m
      slo: rest-api-latency

The table below shows the different alerting rules setup out of the box by Pyrra on our 4-week Latency SLO:

Pyrra’s out-of-the-box alerts are configured following best practices, as discussed earlier, incorporating short and long windows to detect both acute and chronic issues with your service’s SLOs. Labels added to these alerts provide critical context, enabling swift and informed responses to maintain service reliability.

This system ensures you’re not bogged down by complex queries or the resource drain of manual monitoring, allowing you to stay proactive with your SLOs. With Pyrra, you get a high-quality alerting framework that’s both easy to manage and aligned with industry standards.

Adapting to Grafana: Replicating Pyrra’s Alerts in Our Central Tool

While Pyrra and Prometheus Alert Manager work well together, our team prefers using Grafana for all our alerting needs. Thankfully, we’ve found a (albeit slightly hacky) workaround to integrate Pyrra’s alerting capabilities into Grafana. It might be a bit unorthodox, but it keeps Grafana as our central hub for alerts.

Prometheus Alert Manager offers a useful metric called ALERTS, which we can utilize in Grafana. Monitoring all the alerts setup by Pyrra is as easy as running this query:

ALERTS{
  alertname="ErrorBudgetBurn",
  alertstate="firing"
}

A Grafana managed alert based on this query can be evaluated every 30 seconds, mirroring Pyrra’s settings. You don’t need a ‘pending period’ longer than 30 seconds in Grafana, as the Prometheus Alert Manager handles this logic.

The ALERTS metric contains all labels from Pyrra’s alerting rules, including:

exhaustion: time to exhaustion of the burn rate threshold used in the alert;
long: long window;
short: short window;
severity: critical for fast burns, warning for slow burns;
slo: name of the SLO;
and any label you added on your SLOs.

This feature allows for detailed alert management based on the severity label. We typically set up two different alerts in Grafana, one for each severity level, to prioritize our response effectively.

By employing this method, we manage to keep Grafana at the forefront of our alerting strategy, while still benefiting from Pyrra’s sophisticated alert setup.

Reflecting on Pyrra’s Impact in SLO Management

As we wrap up this exploration into the world of SLOs with Pyrra, we’ve uncovered the pivotal role this tool plays in streamlining SLO setup and alerting. Pyrra stands out as a powerful ally in managing the complexities associated with SLOs in a Prometheus environment.

By automating and simplifying tasks that were once laborious, Pyrra enables teams to focus more on service improvement, and less on the intricacies of monitoring and alerting setups. Its integration with Grafana enhances our ability to visualize and respond to SLO performance, solidifying our stance in proactive service reliability management.

Next Steps: Expanding SLO Horizons Beyond REST APIs

Looking ahead, our journey into SLOs doesn’t end here. We’ll delve deeper into more hands-on applications of SLOs in diverse environments, like data processing pipelines, GraphQL APIs, or front ends using Next.js.

Beyond technical implementations, we’ll explore the organizational aspects of SLOs, including iterative refinement processes and strategic approaches to utilizing error budgets. These upcoming articles aim to provide a holistic view of SLOs, covering both their technical applications, and their impact on organizational processes and decision-making.

Stay tuned as we continue to navigate the multifaceted world of Service Level Objectives — uncovering new ways to enhance service reliability and operational efficiency.

This article was written by Adrien Bestel, Principal Ops Engineer @ tb.lx, the digital product studio for Daimler Truck 💚

Read the previous SLO series articles:

First Part: “Navigating Service Level Objectives Series: A practical guide to reliability in tb.lx’s transportation world”
Second Part: “The SLOs Playbook — From indicator selection to alert management”

🚛🌿 If you’d like to know more about how we work at tb.lx, our company culture, work methodologies, tech stack, and products you can check our website and join our journey in creating the transportation solutions of tomorrow through our social media accounts: LinkedIn, Instagram, Youtube, Twitter/X, Facebook. 💻 🔋