Service Level Objectives in Practice
Service Level Objectives,or SLOs are the fundamental basis of all Site Reliability Engineering. Without them you can’t have error budgets, prioritize development work, or do timely and effective incident management.
Objectives in Practice
This is from Chapter 4 of the SRE book: Service Level Objectives.
Start by thinking about (or finding out!) what your users care about, not what you can measure. Often, what your users care about is difficult or impossible to measure, so you’ll end up approximating users’ needs in some way. However, if you simply start with what’s easy to measure, you’ll end up with less useful SLOs. As a result, we’ve sometimes found that working from desired objectives backward to specific indicators works better than choosing indicators and then coming up with targets.
I think this is saying: Come up with non-specific points of view of how reliable your system should be first, and then find out exact ways to measure those things.
Putting it another way: Don’t get tunnel vision looking at the things you measure today, think first about the experience your customers should have, define that as your SLO, then go and invent or discover the best metrics (Service Level Indicators) that allow you to measure that SLO.
For maximum clarity, SLOs should specify how they’re measured and the conditions under which they’re valid. For instance, we might say the following (the second line is the same as the first, but relies on the SLI defaults of the previous section to remove redundancy):
See Service Level Indicators for the previous section.
- 99% (averaged over 1 minute) of
GetRPC calls will complete in less than 100 ms (measured across all the backend servers).
- 99% of
GetRPC calls will complete in less than 100 ms.
You’ll notice maximum time isn’t mentioned here: Generally when setting SLOs we set two: Error rate and latency. We treat very long response times (deadline exceeded, or where the client has likely never even received the message) as errors, and count that our error SLO. Our error rate SLO is not listed here.
Additionally, typically we would typically only require that successful responses are fast. If we measure errors, we might likely skew these figures: such as if errors are extremely fast but the service is extremely slow when it works, or when the service is fast, but in the error case we take a long time.
If the shape of the performance curves are important, then you can specify multiple SLO targets:
- 90% of
GetRPC calls will complete in less than 1 ms.
- 99% of
GetRPC calls will complete in less than 10 ms.
- 99.9% of
GetRPC calls will complete in less than 100 ms.
Remember to measure this from the right place. Typically latency is always measured from a client that is a known distance from the server. Ideally we measure every single request at the client and report back the statistics to a monitoring system.
Two places that are bad to measure from are:
- The server itself: because that server might have measurement glitches due to pauses in its own processing: such as delay accepting the TCP connection, or delay writing data to the network.
- The client on the internet: because your latency numbers will vary wildly depending on their Internet connection. You don’t want to wake someone up in the middle of the night because your app is suddenly popular in Australia!
If you have users with heterogeneous workloads such as a bulk processing pipeline that cares about throughput and an interactive client that cares about latency, it may be appropriate to define separate objectives for each class of workload:
- 95% of throughput clients’
SetRPC calls will complete in < 1 s.
- 99% of latency clients’
SetRPC calls with payloads < 1 kB will complete in < 10 ms.
Be careful with the ‘payload’ option here: this is going to cause your monitoring to be quite complex. “Oh, we didn’t know about that customer having a bad experience, because their requests were all slightly larger than what we measure.”
If you can’t cope with your queries being a variety of sizes because of a very stateful system. You could use SLOs that are a combination of probers: i.e. well behaved queries always work fast, and looser SLOs based on the whole customer base. i.e.
- 99.99% of
SetRPC calls sent by our prober will complete in < 10 ms.
- 99% of of
SetRPC calls sent by customers will complete in < 10 ms.
Because our probes are all ‘well behaved’ queries, they should always be fast, and any slowness is something gone wrong that we should investigate. And because our customers send expensive ‘Set’ operations that are very large <1% of the time, we can make sure that the majority of our customers (99% of them!) have a fast service.
It’s both unrealistic and undesirable to insist that SLOs will be met 100% of the time: doing so can reduce the rate of innovation and deployment, require expensive, overly conservative solutions, or both. Instead, it is better to allow an error budget — a rate at which the SLOs can be missed — and track that on a daily or weekly basis. Upper management will probably want a monthly or quarterly assessment, too. (An error budget is just an SLO for meeting other SLOs!)
I love being able to go and see my partner team’s SLOs and error budgets. It lets me know exactly how reliable my system can be if I depend on theirs, and informs architectural decisions.
The SLO violation rate can be compared against the error budget (see Motivation for Error Budgets), with the gap used as an input to the process that decides when to roll out new releases.
See my commentary on Motivation for Error Budgets if you want to read more about what an error budget is.
Choosing targets (SLOs) is not a purely technical activity because of the product and business implications, which should be reflected in both the SLIs and SLOs (and maybe SLAs) that are selected. Similarly, it may be necessary to trade off certain product attributes against others within the constraints posed by staffing, time to market, hardware availability, and funding. While SRE should be part of this conversation, and advise on the risks and viability of different options, we’ve learned a few lessons that can help make this a more productive discussion:
Don’t pick a target based on current performance
While understanding the merits and limits of a system is essential, adopting values without reflection may lock you into supporting a system that requires heroic efforts to meet its targets, and that cannot be improved without significant redesign.
An example: A task processing system that has SLOs that say that 99.9% of tasks will be executed in the order they’re inserted, latency will be less than 60 seconds from insertion to execution.
Imagine that inserting and retrieving work items from the queue is a bottleneck, so you want to shard it up.
Sharding the task work queues means making a system where the ‘in order’ SLO can’t be kept without huge heroics, so you can’t improve your throughput!
Keep it simple
Complicated aggregations in SLIs can obscure changes to system performance, and are also harder to reason about.
An example of a complex SLO: All queries that are Get queries with a query payload under 100kb with a response under 500kb will succeed 99.99% of the time, and at under 100ms latency, and Set queries that are setting a value under 500kb will succeed 99.95% of the time, and under 200ms latency.
Then put that on a graph and alert your SREs when it’s exceeded…
Keep your SLOs simple. For me. Please.
While it’s tempting to ask for a system that can scale its load “infinitely” without any latency increase and that is “always” available, this requirement is unrealistic. Even a system that approaches such ideals will probably take a long time to design and build, and will be expensive to operate — and probably turn out to be unnecessarily better than what users would be happy (or even delighted) to have.
Always never say always.
Have as few SLOs as possible
Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.17 However, not all product attributes are amenable to SLOs: it’s hard to specify “user delight” with an SLO.
Yes, a thousand times yes. Have a super clear indication of how your system is going via an SLO, and provide drill down on specifics.
Imagine having 100 different URLs on your website, and having a different SLO for each of them. Every time you do a website push you have to update your SLO charts and reports!
Having big buckets for Error Rate and Latency like:
- Static asset serving
- Simple Get operations.
- Complex Get operations (Searches?).
- Set Operations.
Is about the granularity you should be able to get down to.
Perfection can wait
You can always refine SLO definitions and targets over time as you learn about a system’s behavior. It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unattainable.
SLOs can — and should — be a major driver in prioritizing work for SREs and product developers, because they reflect what users care about. A good SLO is a helpful, legitimate forcing function for a development team. But a poorly thought-out SLO can result in wasted work if a team uses heroic efforts to meet an overly aggressive SLO, or a bad product if the SLO is too lax. SLOs are a massive lever: use them wisely.
A good measure is: After your run out of your error budget, and after you’ve written your postmortem and are considering doing engineering to fix your system’s architecture: Go and have a frank conversation with your support staff and Product Management. You’re out of SLO, it was a big incident, you’re considering a release freeze for a few weeks: Did customers notice? How bad was it? Are these SLOs actually measuring worthwhile things?
The SLO is supposed to be a representation of how your customers perceive your systems. If you’re out of SLO: Go find out how bad it was! Failure has provided you a valuable moment: don’t waste it.
SLIs and SLOs are crucial elements in the control loops used to manage systems:
- Monitor and measure the system’s SLIs.
- Compare the SLIs to the SLOs, and decide whether or not action is needed.
- If action is needed, figure out what needs to happen in order to meet the target.
- Take that action.
For example, if step 2 shows that request latency is increasing, and will miss the SLO in a few hours unless something is done, step 3 might include testing the hypothesis that the servers are CPU-bound, and deciding to add more of them to spread the load. Without the SLO, you wouldn’t know whether (or when) to take action.
I call this control loop my monitoring system which is linked to my pager. The single most important reason to page someone to do incident management is when an SLO is in danger.
SLOs Set Expectations
Publishing SLOs sets expectations for system behavior. Users (and potential users) often want to know what they can expect from a service in order to understand whether it’s appropriate for their use case. For instance, a team wanting to build a photo-sharing website might want to avoid using a service that promises very strong durability and low cost in exchange for slightly lower availability, though the same service might be a perfect fit for an archival records management system.
I can imagine a good concrete example here would be comparing using a relational database or a NAS to store your data, and going to your Database and Storage teams to establish what their SLOs are and compare them against your needs.
In order to set realistic expectations for your users, you might consider using one or both of the following tactics:
Keep a safety margin
Using a tighter internal SLO than the SLO advertised to users gives you room to respond to chronic problems before they become visible externally. An SLO buffer also makes it possible to accommodate reimplementations that trade performance for other attributes, such as cost or ease of maintenance, without having to disappoint users.
This sounds like a communication problem: If your users need to be kept away from what your real SLO is, then that’s not okay by me. I believe there should be one set of SLOs, not an ‘internal’ and ‘external’ SLO.
Especially as SLOs are supposed to be a measurement of customer experience!
Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),18 throttling some requests, or designing the system so that it isn’t faster under light loads.
Understanding how well a system is meeting its expectations helps decide whether to invest in making the system faster, more available, and more resilient. Alternatively, if the service is doing fine, perhaps staff time should be spent on other priorities, such as paying off technical debt, adding new features, or introducing other products.
Example of overachieving: You set an SLO of 95% success rate, but achieve 99.99% and your customers will be very inconvenienced if it goes as low as 99%. You can’t just turn around to your customer and say you’re not going to do anything about because you’re still achieving your SLO.
In other words, always be sure that your SLO is an accurate measure of the point at which beyond it your customers will have a bad time.
I am a Site Reliability Engineer at Google, annotating the SRE book on medium. The opinions stated here are my own, not those of my company.