Measuring system reliability using «Service Level Objectives»

Torvald Lekvam
Oda Product & Tech
Published in
7 min readMar 14, 2023

--

If it is hard to reach your audience through conventional means, try a mug! Thanks Charity Majors, for the idea.

Target audience: Everybody in a tech company. Yes, non-technical people too. If you at any point get a bit confused about the terminology; there is a glossary table at the end of the article.

We humorously (but also with a sense of truth) call ourselves Digital Janitors. We operate in the discipline known in the industry as «Reliability Engineering». Most engineers with some interest in this field have read, or at least heard of, Google’s Site Reliability Engineering books. If not, I highly recommend taking a quick look. They're even free! Just skim over the table of content of their first book, I’m sure you will find a couple of relevant and interesting topics.

Pretty early, in the their first book, there is a chapter about Service Level Objectives (SLO). This is a rabbit hole, but one filled with epiphanies💡, rainbows 🌈 and ponies 🐴. But not everyone in your organization needs to get to the bottom of it. On the surface, it is possible to understand what an SLO is and how it can help everybody making better software. Let me make an attempt. I even made some pretty sketches.

It is all about balance. Push too hard on feature velocity and software starts to become unreliable. If you focus too much on getting things to work super reliably, feature velocity goes down.

The SLO framework is a feedback mechanism.

The forces in this closed system is typically in favor of feature velocity. I’m sure you have seen it too. It is not that hard to understand why; getting an objective overview on the overall system performance is much more opaque the hammering out features. What we all typically end up with is an endless stream of runtime errors and an ever-growing organic latency profile.

At the heart of the SLO framework lies the notion of user pain. If we measure and act on user pain, instead of root causes, we fix those issues that matter, and ignore the ones that don’t. Maybe you find it very tempting to set up an alert that fires a page you if your database is at 90% utilization, but is it really a problem for your users?

SLOs rely on «Service Level Indicators» (SLIs). And SLI is just a metric of something (see the SRE workbook for ideas on what to pick). We calculate over a time window, imagine 5 minutes. You count all the good events in your system, and then dived that by all the events in total. Subtract this from 1, and you get the error rate of your system.

A concrete example could be something like this, a web service running on a 4% error rate.

For a classical web page, a good event can be an HTTP request that takes less than— say — 400ms to respond and also returns a 200 status code.

<a grain of salt>Amazon found every 100ms of latency cost them 1% in sales!</a grain of salt>

The last ingredient you need to make an SLO is an numeric objective. Over time, what ratio of good to bad events do we find acceptable? We surly don’t aim for 100% uptime. Most users run around with poor mobile coverage or WiFi signal anyways; there is clearly some room for errors here and there. A typical number could be 0.01%. This means that your service level objective becomes 99.9%. We call this room (the greyed out area underneath) your error budget.

The objective and the system performance thresholds are something that both developers and product managers can figure out together. It depends on risk appetite, prior performance profile of your stack and the nature of the service. Call your Digital Janitor if you need help adjusting the valves.

Over time, we can plot this. There will be small (or big) dips whenever the SLI captures any user pain. Each data-point in the graph represents a 5 minute window where your error rate gets plotted.

Ideally we want our service indicator to stay above objective of 99.9%. Conventional thinking would argue that going below this line should be alerted on — but for SLOs, we do not. Not directly. Every time our indicator dips, we call this a burn.

So instead of directly alerting on our indication declining here and there, we measure our burn rate. If we over the course of multiple days or weeks burn too much of our error budget, we alert, but not before that.

Below is a burn chart. Every SLO has one, independently of your objective. The good side is your error budget. The bad side is where you end up if you spend overspend your budget, and cause more user pain than what you collectively agreed upon was acceptable.

Each month, we host a monthly service review, where we do just that; review our services. We mainly look at SLOs. Here is an example of a service not doing too bad. This service look to have had 2 impactful incident this month, but not enough to violate their promise to its users. Thumbs up, everybody is happy. The users did not have any lasting pain, the PMs got features deployed and the developers were allowed to take some risks.

Here is rather bad one. Only the PM is happy due to the amount of features the team got to push this month!

Each hard burn into the error budget triggers an incident via a page and these get handled accordingly with a proper response and usually their own followups and postmortems. But looking back over a month, when a team, and their service, spend all their error budget — this becomes direct feedback for the team that it is time to put the breaks on features and start working on some reliability measures.

Equally, if a a service is doing too well, it’s okay to ask if people are being too careful rolling out their features.

If you haven’t already, this is probably the point where you should get a little bit of an aha-moment. SLOs will not only increase you chances of building more reliable services, it can also push you to towards higher feature velocity by taking more risk. It is a very well balanced feedback system.

At Oda, we’ve used https://sloth.dev for this, an implementation that is meant to works with the Grafana stack. It also comes with built-in alerting, which is a semi-complicated ball of equations. But at the same time, you don’t really need to understand it all. The promise is; it alerts if you burn too much — while it tries to keep flapping to a minimal.

We’ve still learning, but what we see so far is very promising. I’ll save the real world examples to a followup post.

Glossary:

  • Service Level Objective (SLO): This framework; a general objective measurement of you services.
  • Service Level Indicator (SLI): The query that defines good and bad events in your system.
  • Feature velocity: The rate of new features that enters your system, (too) often used as a proxy for measuring success.
  • User pain: Errors that the user sees, contrary to errors in the system that the user does not see.
  • Error budget: The volume of bad events that you find acceptable to exist in your system.
  • Burn: When your SLI reports on user pain, and this eats into your error budget.
  • Burn rate: The rate of your burn. We alert on high rate. Slow rates are okay.
  • Burn chart: Typically shows you the last 30 days of burn into your error budget — whether you have over-spent, under-spent or if you hit your target.

--

--