Thinking about SLOs
SRE teams need to focus on their SLOs as the basis for the work that they do, but we often say this is important and rarely talk about how to make it happen. What can you do to develop the way SLOs are used in your organisation?
This article shows you what it’s like to work with your SLOs, and how you want your team to think about them.
What is an SLO?
An SLO is a metric that measures the performance of a service, combined with a target for that metric. The target is set such to a level that we believe means the service is “working” when the target is achieved, and “broken” when the target is missed.
Sometimes an SLO is part of an SLA, which is a contract about what will happen if the targets are not achieved. This is primarily a business concern which SREs do not typically engage with, because many services that SRE teams run do not have a formal contract, and because an SLA deals with the situation where the service has been so bad that we have to offer customers some form of compensation.
The key phrase in that sentence is “so bad that we have to”, which leads us to the conclusion that if we are just barely achieving the SLA, the service quality is still quite bad, and our users will be unhappy if the situation continues. As SREs, we would like to treat our SLOs as the basis for an error budget which we can spend, but this means we tend to run our services quite close to the SLO targets.
From this we can conclude that the numbers in the SLA are probably not the SLO that we should focus on in SRE. Instead, we think about the concept of a “happy SLO”: a target which will mean our users are happy with the service.
If we separate SLOs that an SRE team uses from SLOs that are used in SLAs, then this means we have an open question about what exactly SLOs are used for in SRE.
An SRE team should be using its SLOs to direct its major engineering decisions. Exactly what this means depends on the nature of the team and the decisions that it thinks of as important. There are stages of maturity in SRE teams, based on how much progress they have made at bringing their systems under control. We will refer to them as chaotic, reactive, proactive, and strategic. Each one represents a different perspective on how SLOs are applied.
A team which is primarily reactive in its approach will focus on using SLOs to decide when the state of the system has become so bad that it requires a reaction. Common examples of this approach are defining alerts that summon humans for assistance when the SLO target is being missed, and regular status meetings where the SLOs are examined to see which ones have been missed and work is assigned to rectify this.
The SLO in a reactive team is used to define the point at which people stop doing other work and start reacting to service problems. Postmortems resulting from these reactions are used to inform engineering projects.
The above paragraphs may feel surprising or uncomfortable to some readers. Teams which are chaotic are characterised by not having a clear way to decide when a reaction is needed, so people respond based on their perception of how important problems are, and may act inconsistently with each other. If you are used to an environment where you constantly have to make these judgement calls, you may feel like it is impossible to replace what you are doing with a metric. Without digressing too far from the topic of this article, what I would say to you is: it can be done, but it may require substantial engineering work to make this possible.
A team which is primarily proactive in its approach will focus on using SLOs to define error budgets, and controlling the rate at which this budget is spent. An error budget implies that it is acceptable to have some routine levels of errors, and that this can be explicitly allocated to doing useful work.
A common approach here is to measure the rate at which software releases cause the SLO to be impacted, and assign a portion of the error budget to releases, which determines the rate at which releases can be done. When the budget is constrained (typically by an unanticipated outage), the budget that can be assigned to releases is reduced, so releases need to become more reliable. If the budget has reached zero, then releases might be stopped altogether until it has recovered.
It is worthwhile to note that freezing or changing the rate of releases will not improve reliability, as the rate at which new bugs are created is unlikely to change. Skipping or delaying releases merely batches up the errors so they arrive in the next release. Instead, the feedback signal must be plumbed all the way through to the development organisation, which would have to act more cautiously, spend more time on pre-release testing, and reassign developer time to projects which are expected to improve reliability in the short term instead of adding new features.
SRE teams which are acting proactively are likely to be estimating the SLO impact of projects, and adjusting their prioritisation of projects based on the available error budget and whether projects are expected to improve reliability (giving the team more budget to spend) or reduce it (spending the error budget to ship important features sooner).
The SRE team will focus engineering effort on projects which allow changes to be made with lower error costs, such as improved canarying, designing resilience into the system, and automation to run controlled experiments on small numbers of users. They will also be looking forwards for predictable events like increased demand, organic growth, and changes to their dependencies which are likely to affect their SLOs, and making future error budget estimates to guide their engineering decisions.
A common form of this is “we need to ship this new feature to meet capacity demands or we will miss our SLO”, which can cause conflicts when the process of shipping the feature creates more unanticipated outages. A proactive team is characterised by approaching this as a budgeting problem, in the same way that they budget for available engineering time and the operating costs of their service.
SRE teams which are acting strategically have extended this idea to the point where all their major engineering decisions are made based on measured SLO effects. Large projects are defined in terms of the changes they intend to make to the SLO, either improving the targets that are met, or changing the nature of the SLO to be more useful, or accommodating anticipated events which would impact the SLO if nothing was done.
Meeting the SLO no longer requires frequent reactive work, because the budget management is handled by software which takes appropriate action and capacity impacts can be anticipated. The SRE team might choose to reduce or stand down their oncall schedule, because the system no longer needs frequent human intervention.
In some organisations, the SRE team might declare their mission complete and stand down the team, with its members going to work on less mature systems. A strategic team is characterised by its ability to plan how this noble goal might be accomplished, even if that plan is not followed. It is rare for an SRE team to achieve this, and celebrated when it occurs.
In all cases, the theme is that the SLO is the basis for the SRE team making its most important decisions.
If you have a decision which is not currently based on your SLO, and you want to change this, the most important thing to keep in mind is don’t be afraid. People struggle with making decisions in this manner when they are thinking about “losing control” over the decision. The way that you want to be thinking about this change is that instead of controlling the team by direct judgement calls, you will be controlling the team by setting SLOs and acting based on data. This is expected to result in the decision making process being better understood by the people it affects, and the outcomes being more closely aligned with the objectives that you have described when setting the SLOs.
After you have managed to get this working in your team, it will be tempting to attempt to apply it to everything you do. This is probably a mistake: it is relatively expensive to work out what your SLOs should be, and they are unlikely to cover every detail of the work that you need to do. If you try to follow this approach strictly with everything, you will spend a lot of time trying to get your SLOs to cover all the cases and relatively little of your time on engineering work that improves your service. Worse, most of the SLOs that you come up with will be rarely used, so the work is unlikely to pay for itself.
The scenario which you want to avoid at all costs is having projects which would be beneficial if they were done, but do not get done because the amount of work required to determine their interaction with your SLO is too high compared to the cost of doing the project. As with any process, it fails if it is causing simple things to become hard. You want to ensure that there is space for people to make simple, obviously beneficial changes without first having to prove that the change will be beneficial.
You want to have SLOs that cover the most important behaviour of your system, and the most important engineering work that you do. The first part can be managed by thinking about the question “If the system precisely achieved these metrics and no more, would our users be happy?”, and the second part can be managed by estimating how much of your engineering time is spent on projects which are directed by your SLO, however that is functioning in your team.
Picking a sensible target here is subtle, but if your organisation uses Google’s rule of “SREs should spend at least half their time doing meaningful development work”, then you can reasonably translate that into “we should spend at least half of our time doing engineering work directed by our SLOs”.
Defining your SLO
Your SLO is setting the engineering direction for the bulk of the work that the SRE team does. It should be treated with suitable care and attention, considering the vision and strategy for your team and organisation. It will also require substantial engineering work, which will need time allocating to it. Expect it to require several rounds of analysis, review, and experimentation before reaching a conclusion, but don’t be afraid to iterate: if you have something better than the status quo, you should probably implement that immediately while continuing the project to refine it further.
The questions which you are trying to answer are “What things do we need to measure in order to capture the idea that our users are happy with the state of our service?” and “What targets do we need to achieve in order for our users to be happy?”. It may be tempting to directly measure user happiness by asking users, but this is likely to prove challenging. If there are delays between the event in your system which makes users unhappy, and the user becoming aware of this and telling you, then it will be hard to relate user happiness to the state of your service. Instead, you will need to identify what things about your system make users happy or unhappy, and how best to measure those.
You need to find metrics which have a very high signal to noise ratio, which is where a lot of the analysis work happens. Any property that you try to measure is likely to have a lot of sources of noise, which will need to be tracked down and compensated for somehow. A reasonable way to tell when you’re finished is to test for both positive and negative results: outages that you know about should be clearly visible in the graph of the metric, and if you pick some craters that the metric shows which you can’t account for, you should be able to debug a problem in your system that it has identified.
Experience has repeatedly shown that every time we start introduce a new SLO to an existing system, we discover a new way in which the system is broken. The way I phrase this is “If you aren’t measuring it, then it probably doesn’t work”. Updating or expanding your SLO should be expected to reveal new problems and engineering work that needs doing. Conversely, if you do the work to make a change to your SLO and don’t find any new problems, you should suspect that you have done something wrong and try to figure out what it is.
When you have a proposed change to your SLO that has passed these tests, it is time to involve the other stakeholders. Changes to your SLO are in part a product decision, so you will need to engage with whoever handles product management for your system.
It is also important to think about what others teams in the organisation need to be aligned with the work that the SRE team is doing and involve them. If teams that work on the same systems do not have aligned goals, then they will fight each other. It is likely that if the change to your SLO is useful, then at least one other team in the organisation will need to adjust their goals to reflect it. If nobody seems to care, then you are probably doing something wrong.
It is always tempting to set conservative SLOs which you can easily achieve. This temptation is a trap. If your system routinely outperforms the SLOs that you define, then the people who use your system will ignore your SLOs and treat its observed behaviour as if that was the SLO. Once this has happened, it is very difficult to convince these users to accept that the performance of the system can be reduced to the level of your SLO. This regrettable scenario is known as “ratcheting expectations”.
A particularly thorny problem is how you deal with rare events. If your system has a failure mode that you understand, but which consistently happens about once every two years, and creates a substantial impact to your system for about a day, then setting your SLO to be “the system is functioning 99.5% of the time” will not make your users happy even though your SLO as written allows 1.8 days of downtime per year.
Your users will tell you that it’s been working fine for the past two years and they expect you to keep it that way, and no amount of telling them that the problem is hard is going to make them any happier. They have already designed their own systems on the assumption that this would not happen.
If you have this class of problem in your system, then you should consider systematically lowering the reliability of the system to maintain reasonable user expectations. The canonical example of this is the global chubby planned outage: it’s not technically possible to prevent global chubby from having infrequent but catastrophic outages, so instead extra outages are created by turning the system off every quarter, so that users are used to global chubby having outages and design their systems to handle this failure mode.
The best way to ensure you are not caught out by this is to regularly run tests where the system is configured to “run at SLO” for a set time period, during which the system will precisely hit its SLO targets, failing at the rate required. A good model for running these tests is Google’s DiRT tests. Announce this widely in advance and see if anybody complains before or after the test. For everybody who does, spend some time figuring out why their expectations are not aligned with the reality of your system, and what you are going to do about this. The “If you aren’t measuring it, then it probably doesn’t work” rule applies here: if you aren’t running tests like this, then you probably have somebody ratcheting expectations.
SLOs need to be regularly amended, because both the systems that you work on and the desires of your users change over time. There are some warning signs which can alert you to the need for spending some time working on your SLOs.
In whatever variation you use your SLOs to make decisions, have a dashboard which tracks it. If you are reactive, then track the rate of people dropping what they’re doing to respond to an incident. If you are proactive, then track how well your budget planning matched real world spending. (If you have reached the level of strategic, you know more than me about this, please publish a paper on how you did it.)
Look at this dashboard regularly and check whether it seems sensible. For reactive teams, remember to look for problems in both directions: if you spend all your time reacting to problems then you already know you have a problem, but if you never spend any time reacting then your SLO has probably stopped measuring anything interesting and you should figure out why.
Track your postmortems against your SLOs. If you have an outage that you think is worth writing a postmortem for, but there is no measurable SLO impact, then something has gone wrong. If many of your postmortems have no measurable SLO impact, then that indicates the SLO is missing something important, and study of those postmortems should inform your response.
It may help to explicitly report on this alongside your SLO performance, in the form “we have 99.6% uptime in this period, and 45% of our postmortems did not show SLO impact”. This makes it clear how useful the SLO currently is.
Track your engineering projects against your SLOs. You can estimate how much engineering time has been spent on SLO-directed projects, and how many times each SLO is referenced when making engineering decisions. If the team is predominantly working on things which aren’t covered by your SLOs, then this is identifying a gap which you should allocate time to closing. If you have SLOs which aren’t being used to make decisions, then this is identifying an SLO which is no longer relevant to the work the team is doing, and you should look into updating or removing it.
A red flag is any time people make excuses for the SLO. If you have observed an event where your SLO was impacted, and an SRE makes a statement of the form “this is not important because…”, this is a sign to stop and fix the SLO. In this case, not only is the SLO wrong, but it is understood to be wrong. It is hard to understand what is wrong with an SLO, so the problem has been ongoing for long enough for people to figure it out through experience.
If you do not take action to fix this, the team will stop taking the SLO seriously, and will fall back to acting like a chaotic team. It is easy to make the mistake of allowing this situation to continue, because it doesn’t seem urgent and many other things do, but your SLOs set the direction for the team. It doesn’t matter how fast you’re going if you’re heading in the wrong direction.
Maintain your culture
Whatever approach you are taking to working with your SLOs, write it down. Write down what decisions will be made based on your SLOs. Write down the process for changing your SLOs. Make this document brief, precise, and true.
Make this one of the small number of documents that everybody in the team reads. Direct every new member of the team to it, and reference it whenever it is relevant. Do what you can to support people who are trying to change your SLOs. If you think they’re wrong, support them anyway and trust the process to prove it. This will help maintain your SLOs as the focus of your team’s decision making culture.
SLOs only matter if SREs use them.