Second Part: The SLOs Playbook
From indicator selection to alert management
In the first installment of this series, Navigating Service Level Objectives, we explored the foundational aspects of SLOs, from their importance in software engineering to their pervasive impact on different roles within an organization. Now, it’s time to get our hands dirty and focus on implementation.
If you don’t remember what you read, or you’re just too impatient and want to jump right into implementing SLOs, here’s a quick recap of the key takeaways:
- What are SLOs?: Service Level Objectives act as performance benchmarks for services.
- The Need for SLOs: They align goals across various organizational roles, fostering better teamwork and business outcomes.
- Customer-Centric Approach: SLOs enable a focus on measurable user-experience metrics.
- Data-Driven Decision-Making: SLOs facilitate quantitative evaluation, helping prioritize tasks and optimize resources.
- Incident Management: SLOs act as the backbone for classifying and handling incidents, thus mitigating risks.
In this article, we’re diving deeper into both the “what” and the “how” of SLOs. The first part of this article is primarily aimed at Product Managers, while the second part is for the Engineers in the house. We think it’s important to get a full view of what SLOs mean from different perspectives, and we believe it’s crucial that no real silos exist between these roles.
Together, we’ll navigate everything from Service Level Indicator selection to the intricacies of Multi-Window Burn Rate Alerting, and we’ll even touch on tooling with Grafana and Prometheus. Let’s roll up our sleeves and get down to business.
Ok, We Promised Hands-On, So Let’s Get to It
We know you’re here for the actionable stuff, so let’s dive right in. That’s why we’re rolling up our sleeves and diving into a real-world example: a simplified fleet management system.
For this article, we’re keeping it simple with a basic REST API that we’ll use as our guinea pig for setting up SLIs, SLOs, and even error budgets. Why focus on a REST API, you ask? Because it’s one of the most commonly monitored applications and provides a solid foundation for learning SLOs.
You can clone this system from a public GitHub repository which provides a local environment complete with essential monitoring tools and the API.
But don’t worry, we won’t stop there. In our upcoming articles, we’ll evolve our little fleet management system into something more complex. We’ll introduce you to SLOs in different contexts like GraphQL, Data Processing Pipelines, Front End technologies, and more.
This basic REST API is our first step in a journey that will span multiple technologies and use-cases. Let’s get started!
Choosing the Right SLI
We’re not reinventing the wheel here; much of our insights come from the Art of SLOs workshop material made available by Google, which you can find here.
The selection of the right Service Level Indicator (SLI) is all about finding a balance between technical metrics and the user’s actual experience. In this section, we’ll guide you through the process of identifying the most relevant metrics, starting with a manageable scope, and iteratively refining your SLIs. Along the way, we’ll use resources and examples from the Art of SLOs to illustrate our points.
Mapping according to User Satisfaction
In the digital experience, user satisfaction often hinges on the seamless functioning of the service they are accessing. However, since satisfaction is a subjective measure, we approximate it with tangible metrics.
To the end user, the intricacies of our service’s architecture are inconsequential. For example, when using a website of an API, their experience is shaped by the responsiveness and reliability of the service they interact with. A well-chosen SLI quantifies these user-facing aspects by using precise monitoring data.
Consider the comparative illustration below, which depicts two potential metrics for SLI consideration:
On the left, we observe a metric characterized by high variability, lacking a clear correlation with service disruptions. Establishing a threshold for such a metric could inadvertently lead to frequent false positives, diluting the utility of our monitoring efforts.
In contrast, the metric on the right presents a distinct deviation that aligns with the service outage, providing a direct indication of an issue likely to impact user satisfaction. This correlation simplifies the process of threshold definition, enabling us to develop a more reliable and actionable SLI.
In essence, the task at hand is to identify metrics that not only signal an outage but also reflect the service’s performance from a user’s perspective. The right SLI acts as a barometer for user satisfaction, guiding us towards maintaining a high-quality service experience.
The SLI Journey: Beginning with the Basics
Users engage with your services to fulfill specific objectives, making it essential for your SLIs to accurately measure their interactions in pursuit of these goals. These sequences of interactions, commonly termed “user journeys,” should be the compass guiding your selection of SLIs.
To maintain clarity and focus, it’s vital to start with a small, carefully chosen set of SLIs, that resonate most with the critical user journeys. An overly complex set from the outset can overwhelm and send mixed signals to the teams responsible for monitoring service health. A simple beginning allows for a sharp focus on key aspects of the user experience.
Crucially, the development of SLIs is not a one-off task but an iterative journey. Continuously harness data and feedback to fine-tune your SLIs, ensuring they stay in step with both the user experience and service performance. This ongoing process of refinement is indispensable in upholding a service that consistently aligns with user expectations and maintains high quality.
Case Study: Defining SLIs for Our REST API
Wrapping up what was discussed so far, consider these guiding questions to anchor your SLI definitions:
- What reliability does the user expect from this service?
- How can we measure the user’s experience against those expectations through our monitoring?
- In what ways does the user engage with the service?
Refer to the SLI menu illustrated below as a starting point for pinpointing the SLIs that best match a specific user journey:
Recall our earlier introduction to a basic REST API powering our Fleet Management System. Given its “request/response” nature, we’re in a good position to craft some fitting SLIs.
For availability, we can measure the REST API’s health by calculating the ratio of successful requests to the total number of requests — a straightforward count of non-5xx responses is what we’re after here. This is a standard metric typically provided out-of-the-box by web servers.
When it comes to latency, we’ll gauge the time from when a request is received to when a response is dispatched. It’s common practice to exclude the latency of erroneous responses — like those caused by client errors or server timeouts — since they often don’t reflect the system’s true responsiveness. Again, this is a metric web servers commonly track.
Calibrating SLOs: Balancing Aspirations with Reality
With the right Service Level Indicators (SLIs) in place, we now turn our attention to the natural progression in our reliability framework: defining Service Level Objectives (SLOs).
SLOs translate our technical metrics into business meaning, setting the bar for service performance that aligns with both user expectations and business priorities. But setting SLOs isn’t a ‘set-and-forget’ task; it’s an ongoing cycle of calibration. As our services and user needs evolve, so too, must our SLOs.
Here, we’ll guide you through the iterative process of aligning your aspirational targets with achievable performance, ensuring your SLOs are always tuned to the current realities of your service.
Setting Aspirational SLOs: Aligning with Business Objectives
Service reliability isn’t just a technical issue — it’s a business imperative. The performance levels we set as our objectives should, ideally, mirror what’s necessary for the business to thrive. Striking the right balance is essential: over-achieving reliability might mean we’re not innovating or shipping features as fast as we could, while falling short could risk losing our user base to competitors.
It’s about understanding the “just-right” level of service reliability — what we call “aspirational SLOs.” These targets are ambitious and may not be immediately attainable, but they’re what the business aims for in the long run. They’re shaped by careful consideration of user expectations, market competition, and our capability to innovate and grow.
Aspirational SLOs are, by their nature, a stretch. They’re goals that challenge our engineering, product, and operational teams, to push the envelope on what our service can deliver. But they’re also grounded in the reality of our business strategy — they represent a level of service that we believe will keep customers satisfied and engaged over time.
In setting these SLOs, we’re not just aiming for numbers — we’re aligning our service’s trajectory with our business’s path to success. It’s a collaborative effort, requiring input from across the organization to ensure that the SLOs we aspire to truly reflect the performance our business needs.
Setting Realistic SLOs: Learning from the Past
Historical performance data shapes user expectations and provides a factual basis for “achievable SLOs.” These SLOs reflect what users have experienced and have grown to expect from our service. By aligning our objectives with this data, we set targets that are ambitious yet within the realm of what we know can be achieved.
However, we must be cautious. Past performance is a guide, not a guarantee. As user needs and service landscapes evolve, our SLOs must adapt. The difference between past achievements and future aspirations marks the territory for growth. If historical performance falls short of business aspirations, it’s a sign to enhance our service.
In setting SLOs, we use the past as a launchpad, not an anchor, allowing us to plot a course towards continued excellence that resonates with our users and supports our business goals.
Iterating SLOs: A Cycle of Continuous Improvement
The journey of refining Service Level Objectives is never truly complete. As our service evolves and user expectations shift, our SLOs must keep pace, marking a cycle of continuous improvement. The objective isn’t to rewrite the targets with every minor change, but to ensure they represent a true north that guides decision-making and resource allocation.
When we talk about SLOs, we’re bridging the gap between two core assumptions: aspirational, where we aim to satisfy and delight users, and achievable, based on what we’ve consistently delivered. But how do we know if we’re on the right track? The key lies in the feedback loops. Regular reviews of SLOs, informed by user feedback mechanisms like support tickets, satisfaction surveys, and engagement metrics, provide tangible evidence of where our service stands in the eyes of the users.
These reviews shouldn’t be arbitrary; they should be systematic and scheduled, with the frequency adjusted according to how well-established the SLOs are and how rapidly the service is changing. For a new service or after a significant update, you might review monthly. For more mature services, a quarterly or even annual review could suffice. No matter the interval, the goal is to align SLOs more closely with the dual imperatives of business needs and user satisfaction.
This ongoing process ensures SLOs remain relevant and actionable. It’s about fine-tuning, sometimes tightening the targets to push for excellence, other times relaxing them to reflect a broader strategic shift. By making this a regular practice, you ensure that SLOs continue to serve their purpose as effective tools for maintaining and enhancing the quality and reliability of your service.
Implementing SLOs: A Practical Guide with Grafana and Prometheus
It’s now time to roll up our sleeves and implement our own Service Level Objectives. This section is where theory meets practice, where concepts turn into actionable steps. For that, we’ll use the simple fleet management system repository that we introduced earlier.
Setting the stage: the SLO laboratory
Before we dive in, let’s lay the groundwork. To bring our SLO journey to life, we need a real-world playground. Enter our simple fleet management system, a project designed not just to demonstrate but to demystify the practicalities of SLO implementation.
The project setup is straightforward, requiring nothing more than Docker and Docker Compose. Once up and running, you’ll have a fully functional environment comprising:
- A monitoring stack with Prometheus and Grafana, the go-to choice in a cloud-native ecosystem.
- A REST API, built on Spring Boot, backed by a PostgreSQL database, serving as the heart of our fleet management system.
- k6 for load testing, simulating traffic to ensure our system produces relevant metrics for us to build SLOs on.
To get started, a few simple commands are all it takes:
# Kickstart the monitoring stack
docker compose up -d prometheus
docker compose up -d grafana
# Fire up the REST API
docker compose up -d rest-api - build
# Launch load tests against the REST API
docker compose run - rm k6
[this is code that shows the commands referenced above]
And voilà! You’re all set. Now, let’s head over to http://localhost:3000/d/e515d16f-4025-4bb2-bcdb-7d4d5978d92b/rest-api-monitoring to see what’s cooking.
Building an Availability SLO
Now, let’s get practical and create an Availability SLO for our REST API. Availability, simply put, is about how often our service is up and running smoothly. We’ll define it as the rate of successful responses (those without 5xx errors) compared to all requests.
Step 1: Error Rate
We use Prometheus for monitoring, and it’s already tracking every request our API handles. The key metric we’re interested in is http_server_requests_seconds_count
(while this is Spring Boot specific, most web frameworks emit this metric out of the box). It tracks request counts, including their HTTP status codes.
Here’s the Prometheus query to calculate the error rate (5xx responses):
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*",
status=~"5.."
}[1d]
)
) or vector(0)
[this is code that shows the Prometheus query referenced above]
This query adds up all 5xx error responses for endpoints starting with /operators
(the endpoints that matter for our example).
Step 2: Success Rate
To get the success rate, we’ll modify the query to count non-5xx responses:
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*",
status!~"5.."
}[1d]
)
) or vector(0)
[this is code that shows how to count non-5xx responses, as referenced above]
Step 3: Availability
Availability is just the success rate out of all requests. Here’s how we calculate it:
(
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*",
status!~"5.."
}[1d]
)
) or vector(0)
)
/
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*"
}[1d]
)
)
[this is code that helps calculate availability, as referenced above]
This gives us a percentage showing how often our API successfully handles requests.
Step 4: Error Budget
An important aspect of managing SLOs is understanding and tracking the error budget. The error budget represents the allowable margin of error within our SLO target. It’s essentially the difference between our actual availability and the target availability we’ve set in our SLO.
To calculate the error budget, we normalize the difference between our measured availability and our SLO target. Here’s the formula in Prometheus query language:
(
(
(
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*",
status!~"5.."
}[1d]
)
) or vector(0)
)
/
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*"
}[1d]
)
) or vector(0)
) * 100 - ${slo}
)
/
(100 - ${slo})
[this code shows the formula to calculate the error budget, as referenced above]
Step 5: Visualize in Grafana
Finally, let’s bring our SLO data to life with a Grafana dashboard. By visualizing our availability metrics and error budget, we can easily monitor our SLOs and quickly identify when we’re at risk of breaching them.
Step 6: Simulating an Outage
Testing our SLOs under real-world scenarios is essential, and there’s no better way than simulating an actual outage. Let’s see what happens when we pull the plug on our database, imitating a service disruption.
When the database goes down, our dashboard springs to life, showcasing the SLO metrics in action. Here’s the breakdown:
- Success vs. Error Rate: The graphs illustrate an instant surge in error rates, depicted by the spike. This correlates with a corresponding plunge in the success rate, visually validating our outage simulation.
- Availability Drop: As expected, the availability metric takes a dive, slipping below our SLO target line. The speed and depth of the drop-off give us insights into the severity of the outage.
- Error Budget Consumption: The error budget graph shows a sharp decline, indicating the consumption of our SLO’s error budget. This visual cue is critical for understanding the impact of the outage on our SLOs.
The impact of this outage is depicted over 3 different time-windows:
· Instantaneous Impact: The left column of our dashboard offers an ‘instant’ snapshot of our service’s health. As the database halts, we see an immediate spike in error rates, and our success rate tumbles, indicating a clear and present outage.
· Short-Term Observation: The middle column extends our view to a 1-hour window. Here, the spike is less pronounced but still clearly evident. It is showing how a short-lived outage affects service perception over the last hour.
· Long-Term Resilience: The right column looks even further, over a full day. This perspective smooths out the short-term blip, providing context to the outage’s impact on a day’s worth of traffic.
SLOs are often set over more extended periods, such as 4 weeks, to ensure that brief outages don’t disproportionately consume the error budget. This approach helps maintain a balance, preventing a single incident from triggering an undue SLO breach.
Building a Latency SLO
After ensuring our service remains accessible, we now focus on its performance under load with a Latency SLO. Latency SLOs measure the time between a client’s request and the server’s response, ensuring that most requests are fast — a crucial aspect of user experience. Unlike availability, which is binary, latency is a spectrum. Our goal is to keep most requests under a certain threshold, defining the line between fast and slow responses.
Step 0: Generating Metrics from Spring Boot
Spring Boot does not expose latency metrics as histograms by default, but it can be easily enabled in the configuration:
management:
metrics:
distribution:
percentiles-histogram[http.server.requests]: true
slo[http.server.requests]: 10ms,25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s
[This is code showing how to expose latency metrics as histograms, as referenced above]
The key metric we’re interested in is http_server_requests_seconds_bucket.
Step 1: “Fast” Requests Rates
With Prometheus, we define “fast” as any request completed within our target latency. We capture these with a query, counting requests that fall below our SLO’s threshold:
sum(
rate(
http_server_requests_seconds_bucket{
uri=~"/operators.*",
status=~"2..",
le="${latency_slo_sec}"
}[1d]
)
) or vector(0)
[This is the code for the query referenced above]
Step 2: “Total” Requests Rates
To understand the overall demand on our service, we count all requests regardless of their response time:
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*",
status=~"2.."
}[1d]
)
) or vector(0)
[This is code showing how to count all requests, as referenced above]
Step 3: Availability
We define latency availability as the ratio of fast requests to total requests, giving us a percentage that reflects our adherence to the Latency SLO.
(
sum(
rate(
http_server_requests_seconds_bucket{
uri=~"/operators.*",
status=~"2..",
le="${latency_slo_sec}"
}[1d]
)
) or vector(0)
)
/
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*",
status=~"2.."
}[1d]
)
) or vector(0)
[This code shows how to find the Latency SLO]
Step 4: Error Budget
The error budget for latency represents how much deviation from the target latency we can tolerate. It’s a crucial buffer, allowing for occasional spikes without breaching our SLO.
(
(
(
sum(
rate(
http_server_requests_seconds_bucket{
uri=~"/operators.*",
status=~"2..",
le="${latency_slo_sec}"
}[$__rate_interval]
)
) or vector(0)
)
/
sum(
rate(
http_server_requests_seconds_count{
uri=~"/operators.*",
status=~"2..",
}[$__rate_interval]
)
) or vector(0)
) * 100 - ${latency_slo_percent}
)
/
(100 - ${latency_slo_percent})
[This code shows us how to find the error budget for the Latency SLO, as referenced above]
Step 5: Simulating a Slowdown
To test our Latency SLO, we simulate an intentional slowdown by inserting a significant amount of data into our database. This provides a realistic scenario to assess how our service copes with increased load and how it affects user experience.
The screenshot shows our Grafana dashboard during the slowdown. The three columns represent different time frames: instant, 1 hour, and 1 day, just like our Availability SLO. We can observe:
- Instantaneous Latency Spike: A sharp increase in latency is immediately visible, indicating the impact of our simulated database load.
- 1-Hour Trend: Over an hour, the latency graph smooths but remains elevated, providing a clear indication of the system’s performance under stress.
- 1-Day Overview: On a day-long scale, the impact of the slowdown is less pronounced, demonstrating the importance of considering appropriate SLO time windows to avoid overreacting to temporary issues.
This latency test, shown through different time windows, emphasizes the need for SLOs that reflect real user experiences and manage expectations over reasonable periods. It ensures we’re alerted to significant trends and not just temporary peaks, thus avoiding alarm fatigue, and maintaining a focus on sustained performance.
The journey so far
As we reach the conclusion of this article, let’s reflect on the path we’ve traversed. We’ve demystified the core concepts of SLOs, taking them from abstract principles to concrete, actionable metrics. We’ve seen how SLIs form the bedrock of our SLOs, providing us with quantifiable measures of user satisfaction and system reliability.
We’ve gone hands-on, using real-world tools to set up and observe our service’s availability and latency, crafting SLOs that are not just theoretical ideals but practical standards to gauge our performance. We’ve touched on the importance of error budgets, a critical tool in our reliability arsenal, which gives us the leeway to innovate and improve without the constant fear of breaching our SLOs.
Through the use of Prometheus and Grafana, we’ve shown that managing SLOs can be a clear and systematic process, providing us with the visibility and insights we need to make informed decisions about our service. But as with any journey of improvement, there’s more ground to cover.
As we delve deeper, several questions arise that we need to tackle:
- Alerting on SLOs: Knowing when your service is breaching its SLOs is critical. How do we set up effective alerting that notifies us before our error budget is exhausted, allowing us to act preemptively rather than reactively?
- Optimizing Queries: Our current Prometheus queries serve their purpose but at the cost of complexity and repetition. Is there a way to streamline these to make them more efficient and maintainable?
- Handling Large Time Windows: When dealing with SLOs over extended periods, such as 4 weeks, we hit a snag if our Prometheus instance only retains data for 30 days. How do we manage and visualize SLOs over large time windows without losing historical context?
These challenges pave the way for our next article. We’ll delve into advanced strategies for alerting on SLOs, including multi-window burn rate alerting, which offers a nuanced view of service health over different timeframes.
Additionally, we’ll introduce Pyrra, a tool designed to simplify the SLO journey. Pyrra helps streamline the creation and management of SLOs, making it easier to handle complex queries and maintain long-term SLO tracking, even with limited data retention.
Stay tuned as we continue to navigate the nuances of SLOs, equipping you with knowledge and tools to ensure your services remain reliably high performing, no matter what challenges lie ahead.
This article was written by Adrien Bestel, Principal Ops Engineer @ tb.lx, the digital product studio for Daimler Truck 💚
Read the other SLO series articles:
- First Part: “Navigating Service Level Objectives Series: A practical guide to reliability in tb.lx’s transportation world”
- Third Part: “The SLO Toolkit”
🚛🌿 If you’d like to know more about how we work at tb.lx, our company culture, work methodologies, tech stack, and products you can check our website and join our journey in creating the transportation solutions of tomorrow through our social media accounts: LinkedIn, Instagram, Youtube, Twitter/X, Facebook. 💻 🔋