You know that SLOs are important, but how do you create them? Picking the right metrics can make the difference between a bad user experience and a good one.
This article shows you how to find the SLOs that will make your users happy.
Start with the user
“Focus on the user and all else will follow”. It sounds simple, but I have seen this regularly forgotten or ignored. Finding the right SLO is a lot of work, and the streetlight effect plagues us with constant temptation to pick a timeseries that we already have, hang a target on it, and call the job done.
The first problem is to understand who your users are, and how they use your service. Talk to them if you can: write down how you think your service is used, share it with your users, and ask them if you’ve got it right. If that’s not feasible, then find some way to measure what they are doing, either through instrumentation and logging, or through user studies.
Expect to be surprised. There will be a class of users that you didn’t know existed, using your service in a way you hadn’t anticipated. Be wary of reactions like “we don’t support this use case”: somebody will always say this, and sometimes it will be correct, but you need to consult with other stakeholders before writing off a class of users. Product design, development teams, and SRE need to be in alignment about which users you support.
Even if you decide that a class of users will not be supported in your SLOs, you still need to know about them, because you will have to filter them out of your metrics to avoid noise.
Identify your boundaries
No SRE team is responsible for the entire world. If your SLOs are too broad, all they will tell you is that there are problems in places that are beyond your control. So the second problem is to understand what things are your responsibility, and where the boundaries lie.
If you are running an internally-facing service and your users are other engineering teams within your organisation, jointly define the boundaries with them, so that you have agreement on who is responsible for each piece. If your service faces external users, work with your product design team to identify what is and is not part of your product. In either case, expect these boundaries to move fairly rapidly as business needs change, so stay in regular contact and keep asking yourself if your SLO needs to be updated.
Combine this with what you know about your users, and think about how to measure the things which your users want, along the boundary of your service. It is really easy to forget one or the other: measuring convenient boundaries that your users don’t care about, or measuring things your users want but your product design team aren’t interested in providing. At every point in the process of creating your SLO, check that you are not making either mistake.
Much has been said about black-box versus white-box monitoring. In practice, these terms tend to be used to refer to metrics collected from the serving infrastructure (“white-box”) and metrics collected from anything else (“black-box”), which is a simplified view of the world.
For the sake of illustration, we shall assume that we run an RPC server. Our users run clients, which send us RPCs, and we run the servers that receive them and send responses. This might not be precisely how your team’s service works, but it’s a model which most SREs find easy to understand.
There are 5 notable approaches we can take for collecting metrics about this service:
- The serving binary collects metrics about its own behaviour
- The serving binary logs requests, which we later analyse
- Synthetic traffic is sent to the server, and the responses checked for correctness
- The client is instrumented to collect metrics about its interactions with the server
- The client logs its activity, which we later analyse
When people talk about “white-box” monitoring, they are often referring to #1, and “black-box” monitoring often refers to #3. This is not because these two options are inherently better, but because they’re the ones which can most easily be implemented by off-the-shelf tools, which is an instance of the streetlight effect.
The biggest danger of relying on the first three approaches is that you won’t know anything about queries which were sent by clients but never made it as far as being reported by your servers. Interesting ways in which this can happen include: unusual user queries which cause the server to crash before it has a chance to report data, problems in your front-end load balancing which cause queries to be lost, and infrastructure problems like power or networking which create complex partial failures.
Good scenarios to think about here include “all our servers are shut down” and “a single country is unable to send us any network queries”. You are likely to want to use measurement techniques which can detect these cases.
Collecting data at the client is more work, since it involves changes to more than just the servers which your team looks after. There may be additional privacy or legal review required. It has the notable advantage that since you’re measuring the queries sent by your users, then you can be reasonably confident that you can measure transient failures where the query was not received by your servers.
This approach only works if you don’t have correlated failure modes between the client sending requests and reporting its metrics. If the client simply follows every query with a fire-and-forget report, then those reports are unlikely to be delivered in the scenarios that we are interested in. You will want the client-side reporting to store data and upload it at the next available opportunity, so that data isn’t lost due to temporary failures.
Approaches #2 and #5 involve logs analysis. The interesting property of this approach is that you can ask questions which you hadn’t thought of at the time when you collected the data. You are likely to need to do this a lot when defining your SLO, even if it’s not the approach you select for the SLO itself.
You might already have some of these approaches in place. It is worth spending time to explore building data collection pipelines for the ones you are missing. If you don’t have some of these, you’re going to force the streetlight effect on yourself just because you lack the tools to get the rest of the data.
Collecting logs data is something you have to be careful with, as it will often contain significant amounts of private data about your users. Consult with your organisation’s privacy team (or create one if you have to), and take care to be a good steward of user data. You don’t need very much history for analysis to create SLOs, so don’t be tempted to keep it longer than you need.
Compute the aggregate metrics that you need for your SLOs and store those long-term; delete the raw logs data after a week or two. Tightly control who has access to the raw data. If you later discover some of the data is not useful, stop collecting it. Nothing is safer than data you never collected.
If you have both client and server logs, and record a unique identifier in each RPC, you can join these logs together to extract more metrics. An interesting case is the number of queries sent by clients which were never received by servers, which can detect problems in your infrastructure. A significant amount of noise will need to be filtered out in order to make this useful. If your client is a mobile device, you’ll also need to log the network state, as the bulk of the lost queries will be from devices that didn’t have a network connection at that point in time.
You can also run the join in the other direction to validate your client reporting: if queries are received by servers which were not sent by clients, then you know those clients are not reporting metrics to you.
Analysing the data
Once you have collected all the data that might be useful, you will have to analyse it to find out which parts of it are useful. This process is going to involve many iterations and a lot of exploratory work, so don’t try to plan out too far ahead. Pick an idea that you’ve had for something that might work well as a metric, and start digging.
Create a timeseries graph of the metric that you’re interested in. Look for interesting outages that you are already aware of, and see if they show up in the graph. Look for the biggest peaks and craters in the graph which you can’t explain, and debug one that looks interesting: figure out what happened to your system at that point in time, to make the graph do this.
When you learn what is happening, there are a few directions that you might take. You might discover a real problem with your system, and fix it. You might discover a reason why this effect is noise, and think of a better metric that won’t include this noise. One of the more subtle cases is when you discover that your system has “normal noise” in it: when functioning as designed, the behaviour is inconsistent over time, creating noise in your metrics. In this case you want to think carefully about whether you should look for a metric which doesn’t behave this way, or change the design of your system to be less noisy.
As you repeat this process, you will learn new and interesting things about your system. Eventually you will arrive at an understanding of which metrics work well for your system, and you are ready to construct an SLO.
Selecting an SLO
When you have an idea about what you want your SLO to be, the next step is to try taking an adversarial approach: what is the worst system you could build that satisfies this metric? Think about why this system is bad, and what you could do instead to exclude this attack.
To use a common mistake as an example, let’s think about an SLO based on counting how many RPCs return a successful result. If we say that 99% of RPCs must be successful, then we were likely expecting this to mean that 1% of the time, the client will have to retry. A bad system that satisfies this metric would be one that returns correct results to 99% of users, and 1% of users receive errors in response to all of their RPCs. This is probably not what we wanted to do.
That system may at first glance sound unlikely, but this kind of problem is relatively common: these are the corrupted user records in your database, the people with odd hardware that you don’t support properly, the people in a country where the internet is weird in a way that breaks your service, and the people who have unicode characters in their name which crash your server. You probably didn’t intend to design your SLO to permit these kind of problems.
A better way to design this SLO would be to aggregate queries by user, and let your SLI be “how many users have an acceptably low rate of errors”. This still leaves us with similar problems when the problem is associated with “this user on a particular device”. We’ll refer to this class of problems as “correlated errors”. To handle them all, we’ll need to take a step back.
We can introduce a concept of “user sessions”, which is a sequence of related interactions that your user has with your service, however that makes sense for you. If you have a sophisticated client, that may already have a suitable mechanism. Otherwise, an approach that often works is just to say that a sequence of RPCs from one user within a short span of time forms a session. Whatever approach you take, the idea is that the session includes whole user journeys rather than single RPCs.
The user cares whether the thing they were trying to do worked or not. If they needed all of the RPCs to succeed, and the last one failed, they will be unhappy. Conversely, if they didn’t care whether that last RPC succeeded, you probably don’t need an SLO for it at all. If the client could retry the failed RPC and still give a response to the user in reasonable time, maybe it doesn’t matter how many RPCs fail. A better metric in this scenario is how many RPCs eventually succeed after accounting for retries.
Ideally, you will be able to extract some concept of “the user journey worked”, but if that is intractable or moving beyond your boundaries, an approximation that captures the spirit of your original idea is “how many user sessions have no errors in them”. By aggregating at the level of user activity, we remove correlated errors.
Lots of SRE material talks about reliability in terms of nines: “three nines” is 99.9%, “five nines” is 99.999%. This is useful for conversational purposes, but be wary of thinking about these numbers as absolute measures of reliability. Depending on how many queries there are in a user session and how evenly the errors are distributed, “three nines of user sessions having no errors” might be a higher level of reliability than “five nines of RPCs having no errors”.
Don’t be alarmed if you look at your current performance in aggregated metrics and it seems a lot worse than you thought. It might be that bad, or it might just be the result of how the numbers work in your aggregation. Investigate more closely, and find out what is happening.
The lesson to take away from this is that the way you aggregate your metrics is vital to creating an SLO that will result in happy users.
Long term targets
When you find an SLO that you are convinced is good, you will probably also find that your system’s current performance against this SLO is a lot worse than you would like it to be. This is a good thing: all the work you have done in looking for an SLO is now justified, because you have established the need for engineering work and an objective measure of when it is completed.
You are likely to encounter resistance if you try to apply the SLO that you would like to have immediately. People will tell you that you’re meeting your current SLOs, and it looks bad if you change the SLOs and now are not meeting them.
What you want to do here is distinguish between the current SLO and the target SLO: the current SLO is the one that the SRE team is working to for now, and the target SLO is the one which engineering projects will advance you towards. As those engineering projects get done, you increase the current SLO to match the reliability improvements that you have delivered.
Note that as part of establishing the new SLO, you will need to secure agreement that this engineering work is worthwhile and important, and then drive the projects through to completion. It doesn’t help anybody if you set a target SLO and nobody is willing to do the work to improve your systems.
You will probably find that you have multiple important metrics. You could simply define an SLO for each one, but if you have a lot of them then this can become unwieldy: your SLO dashboard has too many entries, half of them are red, and there’s no clear priority about what needs doing. SLOs are supposed to give us clear priorities, and if they don’t then something is wrong.
To improve this situation, you will need to aggregate your SLOs together into simple, meaningful metrics that express your true priorities.
A popular approach here is to sum the underlying metrics into SLO buckets. You might sum “all the queries relating to this set of features” and create an SLO about how many of those succeed. This approach is popular because it is easy and gives results that seem superficially reasonable. The problem with this approach is base rate bias.
Let’s suppose that our RPC service has three types of RPC, which we’ll call Lookup, IterateRange, and Write. It receives 10 Write queries, 1000 IterateRange queries, and 10000 Lookup queries in every second, and we’ve set our SLO to be 99% of queries succeeding. We’ve bucketed all the RPCs together into the “Storage” SLO.
What happens when a new release causes 1% of IterateRange queries to fail? 1% of 1000 is 10, so we now have 10qps of errors in our Storage bucket, compared to 11010qps total, so 0.09% of Storage queries are failing, and we report that we are still within SLO. That probably wasn’t what we wanted to happen.
Worse, let’s suppose all of our disks have become read-only and all writes are failing. That’s still only 10qps of errors. We can fail every Write operation and still claim to be within SLO.
Would a “happy sessions” SLO save us here? It turns out that our storage system is written to by one set of clients and read from by a different set of clients, and we have lots more read sessions than write sessions. That isn’t going to solve our problem.
We need to go back to the first rule: focus on the user. What do users want here?
Users don’t want storage systems where most of the parts work but some don’t. They expect all of the parts of their storage system to be working. If any one of them is broken, they consider the system to be broken.
We can express this by writing our SLO as “99% of Lookup queries succeed AND 99% of IterateRange queries succeed AND 99% of Write queries succeed”. Now if 1% of any type of queries is failing, we report the Storage SLO as failing.
We would like to report our SLO performance as numbers rather than “pass/fail”, so this approach often become a “bad minutes” SLO. The metric we use is “the number of minutes in the reporting period when all types of query were meeting their targets”. This gives us a metric expressing “how much of the time users had a happy experience”.
If we now consider what happens when the SLO is missed, we discover that we have solved our prioritisation problem: the priority is to improve whichever type of query is causing the most bad minutes. That makes sense, and is something we could expect the rest of our organisation to align with.
Include your dependencies
While it is important to constrain your SLO to the set of things you can control, this is not a reason to exclude the systems that you depend on. Your SLO is supposed to represent user happiness, and your users will not be happy if your service is unavailable because you depend on some other service which is currently broken.
You will implicitly include your dependencies if you are measuring whether your service is doing what your users want. Eventually, one of those dependencies will have an outage which results in you having to report that you have missed your SLO.
When this happens, you are likely to find yourself in a discussion about whether to change your SLO so that it doesn’t include this outage, and somebody will suggest that you are not responsible for problems in your dependencies — after all, that is handled by a different SRE team.
The two points that you need to raise in this discussion are:
- Your service was broken and your users were unhappy. Your SLO report is true.
- Even though you don’t run the system that you depend on, you are still responsible for the design decision to use that system.
If the reliability of a dependency is causing problems, you can engage with the team that looks after it to see if it can be improved. If it cannot be improved to a level that will let you meet your SLO, you can redesign your own system to stop depending on it.
Another variation on this problem is when you are attempting to define your alerting based on your SLO. It is reasonable to create alerts when you have measured SLO impact, but you don’t want to page yourselves for problems in your dependencies which you can do nothing about, and which will page the correct team anyway. In this scenario, you want to adjust your alerting, not your SLO: suppress the alert when you can detect that the problem lies in a dependency.
Avoid saying “it’s not our problem”. Own the user experience, and look into what can be done.
Plan to iterate
This article has explained several flawed approaches to creating SLOs, and some alternatives that are likely to work better. Despite this, don’t let the search for perfection block progress. It is easy to keep analysing in search of a more comprehensive SLO, and forget that what you are currently using is worse. If you have a workable metric for one of the flawed approaches, and it’s clearly better than what you’re currently doing, then deploy it.
Changing your SLO implies changing your priorities, which is expected to result in adjustment of engineering direction. If this happens too frequently, it will become hard to deliver any engineering projects, since continually shifting goals will confuse and demotivate the people involved.
It may help you to have a schedule for SLO revision. If you have a regular cycle of updating your SLOs, then you will become used to accomplishing these changes, without changing them so often that you can’t make progress. Depending on how quickly engineering projects run in your organisation, anywhere from three to twelve months might be a sensible cycle time.
Having a regular schedule will also help you plan for engineering time to do more analysis work based on everything that you have learned. Look at how long it took to update your SLOs last time, and create the same amount of time to spend on it for the next cycle.
Record your process
It is important to record the process you used to work out your current SLOs, so that people can check the assumptions you have used and identify when they stop being accurate.
Colab and jupyter are great for this: you can write your calculations in notebook form which can be executed again at a later date, along with explanations of what you’re doing and why. Try to construct your notebook from the start, as you begin exploring, and refine it as you learn more.
Write down all the assumptions that you’re using, especially ones about what your users are doing and where the boundaries of your responsibilities lie. Those are the things most likely to change over time.
The SRE workbook contains examples of how people have designed SLOs, which I highly recommend.
Ben Treynor Sloss has some excellent advice on which metrics matter, and how you’re going to feel when you measure them.
Gil Tene on how not to measure latency. I have never managed to measure latency in a way that I was happy with, but at least I try not to make these mistakes.