Don’t Get Lost in the Metrics Maze: A Practical Guide to SLOs, SLIs, Error Budgets, and Toil

Lokesh Aggarwal
4 min readMay 14, 2024

--

In the ever-evolving world of software development and operations, ensuring reliable and performant systems is crucial. But navigating the sea of metrics and practices can feel overwhelming. This blog post cuts through the complexity and offers a practical guide to four key concepts: -SLOs (Service Level Objectives), SLIs (Service Level Indicators), Error Budgets, and Toil.

Setting Expectations: SLOs and SLIs

Consider our neighboring Pizza restaurant.

The SLO is their promise: “We’ll deliver your pizza in 30 minutes of ordering else its free.”

The SLI, in this case, is the actual average delivery time. By monitoring the SLI, they can see if they’re consistently meeting their SLO or giving away the Pizza for free.

In the context of Site Reliability Engineering (SRE), a Service Level Objective (SLO) is a specific, measurable target that defines the desired performance of a service. It essentially sets the expectations for how well a service should function from a user’s perspective. It could be the uptime percentage, latency (response time), or error rate.

SLI is a measurable metric that tracks your progress towards achieving that SLO. Common SLIs include request processing time, number of errors, and resource utilization.

Balancing Risk and Innovation: Error Budgets

Think of an error budget like a “spending limit” for your SLO. It reflects the acceptable amount of time your service can deviate from its SLO before impacting users. A high error budget allows for more releases, but carries a risk of user dissatisfaction. Conversely, a low error budget prioritizes stability but may hinder development velocity.

Error Budget Examples:

1. E-commerce Platform:

  • SLO: 99.9% uptime during peak shopping seasons (e.g., Black Friday, Cyber Monday).
  • Error Budget: 0.1% downtime allowance during peak season translates to approximately 52 minutes of downtime. This allows for planned maintenance or unexpected issues without significantly impacting sales.
  • Benefits: The platform can prioritize new features and functionalities during off-peak periods while maintaining stability during crucial shopping events.

2. Social Media App:

  • SLO: Average latency (response time) under 100 milliseconds for core functionalities (e.g., newsfeed updates, messaging).
  • Error Budget: Allows for occasional latency spikes exceeding 100ms for non-critical features (e.g., personalized recommendations) without significantly impacting user experience.
  • Benefits: The app can introduce new features and functionalities that might initially cause minor latency increases while ensuring a smooth experience for core functionalities.

3. Content Delivery Network (CDN):

  • SLO: 99.95% availability of critical content across all geographical regions.
  • Error Budget: Allows for short regional outages affecting a small percentage of users without compromising overall service availability.
  • Benefits: The CDN can implement performance optimizations or introduce new features in specific regions while maintaining high availability for critical content globally.

Finding the Efficiency Sweet Spot: Toil

Toil refers to the repetitive and time-consuming tasks that often bog down SREs (Site Reliability Engineers) and developers. It can include manual configuration, server maintenance, or troubleshooting recurring issues. Minimizing toil through automation, infrastructure as code, self healing and well-defined processes frees up valuable time for innovation and proactive problem-solving.

Toil away this years work

The Golden Signals and Putting it All Together

The Golden Signals (latency, traffic, saturation, and errors) are a set of high-level metrics that provide valuable insights into a service’s overall health. By monitoring these signals, SREs can identify potential issues before they significantly impact users. Golden signals can be configured into dashboards on Observability platforms like Prometheus.

Platform: Prometheus (open-source platform)

  • Data Collection: Integrate Prometheus with application monitoring tools and scrape resource utilization data from servers using exporters.
  • Visualization: Utilize pre-built dashboards for Golden Signals or create custom dashboards using PromQL queries to visualize specific metrics.
  • Alerting: Define alerts based on PromQL expressions that trigger notifications when Golden Signal metrics exceed set thresholds.

Here’s how these concepts work together:

  1. Define your SLOs: Clearly outline the performance expectations for your service.
  2. Choose relevant SLIs: Select measurable metrics that track progress towards your SLOs.
  3. Set your error budget: Determine the acceptable deviation from your SLOs.
  4. Minimize toil: Automate tasks and streamline processes to improve efficiency.
  5. Monitor Golden Signals: Proactively identify and address potential issues.

Infrastructure and Observability Metrics Table:

By implementing these practices, you can move beyond a metrics maze and gain a clear understanding of your system’s health. This data-driven approach allows you to make informed decisions for optimizing performance, minimizing downtime, and delivering a consistently reliable experience for your users.

Remember: Don’t get lost in the details. Start by focusing on a few key metrics and gradually refine your approach as you gain experience. By understanding SLOs, SLIs, error budgets, and toil, you can empower your team to build and maintain high-performing, user-centric systems.

--

--

Lokesh Aggarwal

Lokesh is a technology blogger with a passion for Artificial Intelligence (AI), Applications, Service and Program Management in the enterprise world