Service Level Indicators in Practice

How well is your system working, right now?

8 min readMay 11, 2017

Do you even know how well your system is working? How many errors are you serving? Would you even know if 1% of your requests time out?

Service level indicators are literally the most important piece you need in order to apply SRE principles. Even if you think you have them, they might not be high quality enough for you to accurately gauge your customer’s experience, and they will mislead you.

Indicators in Practice

This is from Chapter 4 of the SRE book: Service Level Objectives.

Given that we’ve made the case for why choosing appropriate metrics to measure your service is important, how do you go about identifying what metrics are meaningful to your service or system?

In my series of posts about the SRE book, I have done commentary on the need for error budgets, and about the risk tolerance of services. But now something that’s extremely important: Defining exactly how we measure the system’s performance.

What Do You and Your Users Care About?

You shouldn’t use every metric you can track in your monitoring system as an SLI; an understanding of what your users want from the system will inform the judicious selection of a few indicators. Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter, while choosing too few may leave significant behaviors of your system unexamined. We typically find that a handful of representative indicators are enough to evaluate and reason about a system’s health.

It is very easy once you start monitoring your first few metrics, to measure too many. More than you will ever look at. I have joined ‘mature’ teams that have many many dashboards, all showing a variety of increasingly complicated metrics, yet been unable to answer the question: “Is my service healthy right now” with any accuracy.

Services tend to fall into a few broad categories in terms of the SLIs they find relevant:

User-facing serving systems, such as the Shakespeare search frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?

The simpler a user-facing system is, the easier this is. Often you need to split up and account differently some data, such as logged-in vs. unauthenticated requests, and treat read-only and data-mutating requests separately.

Often you have very different latency requirements depending on the request: downloading an image should always be fast, but it’s okay for your PDF report generator to take a minute.

Storage systems often emphasize latency, availability, and durability. In other words: How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it? See Data Integrity: What You Read Is What You Wrote for an extended discussion of these issues.

Durability is the big new concept here. This is one of those places in large scale serving where we really do think about ‘extra nines’. For instance it’s okay for your service to be down an hour a month, but to lose a few gigabytes of your users data is unthinkable.

Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency. In other words: How much data is being processed? How long does it take the data to progress from ingestion to completion? (Some pipelines may also have targets for latency on individual processing stages.)

Pipelines are a great way of doing batch processing: And measuring how good your batch processing is going can’t be done on “Did this request work,” because provided the pipeline is running correctly all work will eventually be done, and even if it breaks we have time to fix it without user pain. So we have to think about it a different way.

All systems should care about correctness: was the right answer returned, the right data retrieved, the right analysis done? Correctness is important to track as an indicator of system health, even though it’s often a property of the data in the system rather than the infrastructure per se, and so usually not an SRE responsibility to meet.

It’s certainly the SRE responsibility to Roll Back the release if there’s a correctness problem. But then we hand it back to the developers and ask them to make sure it’s correct in future.

Collecting Indicators

Many indicator metrics are most naturally gathered on the server side, using a monitoring system such as Borgmon (see Practical Alerting from Time-Series Data) or Prometheus, or with periodic log analysis — for instance, HTTP 500 responses as a fraction of all requests. However, some systems should be instrumented with client-side collection, because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics. For example, concentrating on the response latency of the Shakespeare search backend might miss poor user latency due to problems with the page’s JavaScript: in this case, measuring how long it takes for a page to become usable in the browser is a better proxy for what the user actually experiences.

For most web serving systems, collecting data at the load balancer in front of your web servers is an excellent place. The load balancer is technically the ‘client’ of your web servers so you can do ‘client side’ monitoring.

You can measure availability of your load balancer with a cheap high-frequency prober in order to detect whole site downtime or elevated latency.

Aggregation

For simplicity and usability, we often aggregate raw measurements. This needs to be done carefully.
Some metrics are seemingly straightforward, like the number of requests per second served, but even this apparently straightforward measurement implicitly aggregates data over the measurement window. Is the measurement obtained once a second, or by averaging requests over a minute? The latter may hide much higher instantaneous request rates in bursts that last for only a few seconds. Consider a system that serves 200 requests/s in even-numbered seconds, and 0 in the others. It has the same average load as one that serves a constant 100 requests/s, but has an instantaneous load that is twice as large as the average one. Similarly, averaging request latencies may seem attractive, but obscures an important detail: it’s entirely possible for most of the requests to be fast, but for a long tail of requests to be much, much slower.

The problem of requests coming in bursts that are much shorter than your sampling interval is a real one, but not one you should worry about monitoring continuously. It’s fine to measure every 30 seconds to several minutes, provided you can activate some kind of logs to check for this kind of dysfunction later.

Just make sure your measuring interval and your error budget are compatible. You shouldn’t measure every 5 minutes if your error budget is less than 30 seconds a month!

Most metrics are better thought of as distributions rather than averages. For example, for a latency SLI, some requests will be serviced quickly, while others will invariably take longer — sometimes much longer. A simple average can obscure these tail latencies, as well as changes in them. Figure 4–1 provides an example: although a typical request is served in about 50 ms, 5% of requests are 20 times slower! Monitoring and alerting based only on the average latency would show no change in behavior over the course of the day, when there are in fact significant changes in the tail latency (the topmost line).

Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes: a high-order percentile, such as the 99th or 99.9th, shows you a plausible worst-case value, while using the 50th percentile (also known as the median) emphasizes the typical case. The higher the variance in response times, the more the typical user experience is affected by long-tail behavior, an effect exacerbated at high load by queuing effects. User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values, on the grounds that if the 99.9th percentile behavior is good, then the typical experience is certainly going to be.

Critically here: your monitoring here might be that your “99.9th” percentile high latency translates directly into 0.1% of queries are simply deadline-exceeding. So be careful about what you’re measuring!

A Note on Statistical Fallacies

We generally prefer to work with percentiles rather than the mean (arithmetic average) of a set of values. Doing so makes it possible to consider the long tail of data points, which often have significantly different (and more interesting) characteristics than the average. Because of the artificial nature of computing systems, data points are often skewed — for instance, no request can have a response in less than 0 ms, and a timeout at 1,000 ms means that there can be no successful responses with values greater than the timeout. As a result, we cannot assume that the mean and the median are the same — or even close to each other!
We try not to assume that our data is normally distributed without verifying it first, in case some standard intuitions and approximations don’t hold. For example, if the distribution is not what’s expected, a process that takes action when it sees outliers (e.g., restarting a server with high request latencies) may do this too often, or not often enough.

Something we see when we start measuring latencies is a bi-modal or tri-modal distribution. This is often caused by picking a poor SLI. Perhaps you have a service that does two things. One is fast (50ms), one is slow (300ms), and the slow one is used 5% of the time. This means that your median and 90th percentile latencies will hover around 50ms. But as soon as someone does a batch of ‘slow’ operations, you’ll see your 90th percentile latency skyrocket to 300ms! 6 times slower!

But in fact your SLI is really measuring if the ratio of operations your customers are doing is staying below 10:1, not measuring customer pain at all!

Standardize Indicators

We recommend that you standardize on common definitions for SLIs so that you don’t have to reason about them from first principles each time. Any feature that conforms to the standard definition templates can be omitted from the specification of an individual SLI, e.g.:

Also: Display them in a centralized place. This allows for excellent communication between stakeholders.

Aggregation intervals: “Averaged over 1 minute”
Aggregation regions: “All the tasks in a cluster”
How frequently measurements are made: “Every 10 seconds”
Which requests are included: “HTTP GETs from black-box monitoring jobs”
How the data is acquired: “Through our monitoring, measured at the server”
Data-access latency: “Time to last byte”

To save effort, build a set of reusable SLI templates for each common metric; these also make it simpler for everyone to understand what a specific SLI means.

I am a Site Reliability Engineer at Google, annotating the SRE book on medium. The opinions stated here are my own, not those of my company.