Service Level Objectives
Written by Chris Jones, John Wilkes, and Niall Murphy with Cody Smith
Edited by Betsy Beyer
I worked with Chris for 3 years.
It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product.
The words “level of service” seemed odd to me. So I went to go find out where else in English this phrasing is used: Turns out it’s used to quantify traffic on highways.
We are talking about network and Internet traffic, not highway traffic.
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives(SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.
When dealing with a system in crisis, Chris interviewed every engineer on the team and asked the question, “How do you know, right now, that the system is healthy.” And analysed the answers.
We quickly discovered that we had chosen all the wrong metrics in the beginning, and ended up doing a lot of re-engineering so that we could measure more appropriate ones.
A lot of the crisis that our system was in, was around the problem that we never confident the service was healthy! Getting that right was the first thing we fixed, and it led to a good outcome.
This chapter describes the framework we use to wrestle with the problems of metric modeling, metric selection, and metric analysis. Much of this explanation would be quite abstract without an example, so we’ll use the Shakespeare service outlined in Shakespeare: A Sample Service to illustrate our main points.
Our favorite over-engineered service. My commentary is in the appropriately titled: The Production Environment at Google (Part 2).
Service Level Terminology
Many readers are likely familiar with the concept of an SLA, but the terms SLI and SLO are also worth careful definition, because in common use, the term SLA is overloaded and has taken on a number of meanings depending on context. We prefer to separate those meanings for clarity.
I fight the battle against conflation of SLO and SLA every day, because it means a lot to my team having external customers. Many teams with only internal customers call everything an SLA, and I think that’s okay: no need to be strict in terminology if you’re not going to be misunderstood.
Here we will discuss just
An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.
Important: Your SLI should be some measure of customer satisfaction or discomfort.
- Bad SLI: System CPU Load.
- Good SLI: 99th percentile HTTP Response Time.
Most services consider request latency — how long it takes to return a response to a request — as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile.
I hope you have a system that can achieve this, and if you don’t, you should invest in implementing one.
Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret. For example, client-side latency is often the more user-relevant metric, but it might only be possible to measure latency at the server.
One such proxy would be using a prober to test the availability of your system. At Google we take for granted that every piece of software we have publishes metrics to a centralized monitoring platform, but if you cannot hook into your software to record all the information you want and publish it, then a good prober to check to see the performance of your system is a great way of getting an SLI.
Probers are best when the probe is very representative of your customer traffic, and when there’s little to no state involved. For instance, probing a HTTP load balancer is a great way to make sure it’s functioning properly and fast, but probing a highly stateful website may never find out that half the customers were deleted from the database.
Another kind of SLI important to SREs is availability, or the fraction of the time that a service is usable. It is often defined in terms of the fraction of well-formed requests that succeed, sometimes called yield. (Durability — the likelihood that data will be retained over a long period of time — is equally important for data storage systems.) Although 100% availability is impossible, near-100% availability is often readily achievable, and the industry commonly expresses high-availability values in terms of the number of “nines” in the availability percentage. For example, availabilities of 99% and 99.999% can be referred to as “2 nines” and “5 nines” availability, respectively, and the current published target for Google Compute Engine availability is “three and a half nines” — 99.95% availability.
I’ve talked about these numbers in previous articles. I typically work on systems that are ‘Three and a half nines’ because it’s the point at which humans can still fix problems if they’re well trained and attentive, but work can be done to make the systems auto-healing and safer. It’s the right point on the cost-benefit curve.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results “quickly,” adopting an SLO that our average search request latency should be less than 100 milliseconds.
By definition, every moment that our customers experience latency above 100 milliseconds eats into our error budget.
Choosing an appropriate SLO is complex. To begin with, you don’t always get to choose its value! For incoming HTTP requests from the outside world to your service, the queries per second (QPS) metric is essentially determined by the desires of your users, and you can’t really set an SLO for that.
A perfectly reasonable SLI around requests-per-second serviced could be around a pipeline system, because a pipeline can back-up if it’s running too slowly, you might require that it’s rate of processing is fast enough, most of the time. i.e. At least 90% of the time, the pipeline is processing 100k records per second.
On the other hand, you can say that you want the average latency per request to be under 100 milliseconds, and setting such a goal could in turn motivate you to write your frontend with low-latency behaviors of various kinds or to buy certain kinds of low-latency equipment. (100 milliseconds is obviously an arbitrary value, but in general lower latency numbers are good. There are excellent reasons to believe that fast is better than slow, and that user-experienced latency above certain values actually drives people away — see “Speed Matters” [Bru09] for more details.)
Make sure that your SLO matches your customer expectations! You want your SLO to be tripped if customers might possibly notice something going wrong.
This is intricately linked to your SLI: if your SLI tracks a metric that your customers care about (such as service latency), and you define an SLO that will be tripped at or before your customers will notice, then you have a good start on your SLOs.
On the other hand, you never want to be in the position where you’re supporting a customer and they contact you to say, “I’m upset with how slow your service is.” and meet their accurate problem report with “But the system is in SLO” because your SLO was not in line with customer expectations.
Your Objective is to deliver a system that most people will be happy or very happy with.
Again, this is more subtle than it might at first appear, in that those two SLIs — QPS and latency — might be connected behind the scenes: higher QPS often leads to larger latencies, and it’s common for services to have a performance cliff beyond some load threshold.
You have two error budgets, they’re separate, and the same incident can eat both your error and latency error budgets and that’s expected.
Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service being slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is (as happened with Chubby: see The Global Chubby Planned Outage), and under-reliance, when prospective users believe a system is flakier and less reliable than it actually is.
Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.16
Because an SLO is much tighter than an SLA, it is very normal for your SRE team to only care about SLOs. Because if an incident is so bad that it’s violating the SLA, they’re already working as hard as possible to resolve it.
The only time I ever care about an SLA in my day-to-day, is when writing a postmortem (or post-incident report), working out the cost-impact of the outage.
SRE doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions. SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs. They can also help to define the SLIs: there obviously needs to be an objective way to measure the SLOs in the agreement, or disagreements will arise.
SRE should definitely help measure SLAs. Because accurate measurement is best done by the team who does the monitoring and reporting of metrics. This is in partnership with product management and sales.
It’s always good to vet your external SLAs to make sure they’re achievable. I can imagine the damage a poorly thought out SLA can do!
Google Search is an example of an important service that doesn’t have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we haven’t signed a contract with the whole world. Even so, there are still consequences if Search isn’t available — unavailability results in a hit to our reputation, as well as a drop in advertising revenue. Many other Google services, such as Google for Work, do have explicit SLAs with their users. Whether or not a particular service has an SLA, it’s valuable to define SLIs and SLOs and use them to manage the service.
The story of the Global Chubby Planned Outage, where we discuss using planned outages to make sure that fault tolerant systems remain so.
You can see all my posts in order here.