SLOs that lie

Published in

last9

7 min readNov 3, 2020

SLO is an acronym for Service Level Objective. But before I explain SLO, you need one more acronym SLI (Service Level Indicator)

An SLI is a quantitative measurement of a (and not the) quality of a Service. It may be unique to each use-case, but there are certain standard qualities of services that practitioners tend to follow.

Availability The amount of time that a service was available to respond to a request. Referred to as Uptime
Speed How fast does a service responds to a request. Referred to as Latency
Correctness Response alone isn’t good enough. It also matters whether it was the right one. Referred to as ErrorRatio

SLOs are boundaries of these measurements where you would begin to worry about your service. Example: If the speed of delivery at a restaurant falls below 5dishes per hour. You know customers are going to have longer waiting periods. Having the right SLOs help you take better decisions (More in this blog)

Uptime

One of the primary indicators and objectives is Uptime. If the site isn’t serving, let alone the speed and accuracy. Also, why uptime is usually desired to be towards 100%.

If you are a restaurant that sees footfall throughout the day, why shut and lose business.

How do you compute uptime? Fairly straightforward. The time that you were up / total time.

x = Up / ( Up + Down ) %

99% uptime means out of 365 days * 24 hours * 3600 seconds you have 1% allowance to not be available. Over a year that is 3.7 (~4)days.
But would you be OK remaining shut for consecutive 4 days from 25th December — 28th December?

If the answer is NO, then this is not a ration that you can keep accumulating. It’s more like a periodic allowance that you get a refill on, too bad if you couldn’t spend it.

Based on each bucket, the downtime allowance may seem more palpable.

┌─────────┬────────────┬───────────┬───────────┬───────────┐
│   9s    │  per Day   │ per Week  │ per Month │ per Year  │
├─────────┼────────────┼───────────┼───────────┼───────────┤
│ 99%     │ 14.4 Mins  │ 1.7 Hrs   │ 7.3 Hrs   │ 3.7 Days  │
│ 99.9%   │ 1.4 Mins   │ 10.1 Mins │ 43.8 Hrs  │ 8.7 Hrs   │
│ 99.99%  │ 8.6 Secs   │ 1 Mins    │ 4.4 Mins  │ 52.6 Mins │
│ 99.999% │ 0.864 Secs │ 6.1 Secs  │ 26.3 Secs │ 5.3 Mins  │
└─────────┴────────────┴───────────┴───────────┴───────────┘

How do you tell that service is up?

A request comes in, the request was served. Your service is Up. There are plenty of tools you can use for this including Prometheus, Stackdriver, etc. These tools piggyback on the software components to emit a logline or a metric.

How do you tell the service is down?

A request comes in, but it wasn’t served. For a request that wasn’t served, there won’t be a logline or a metric value emitted.

In a store that is supposed to be up 24x7, the staff bunks randomly, how do you measure how long were they away for?

Option 1: SDK (Measure at each caller)

We could use an SDK that tracks every outbound request to our service. Depending on the design, one may be using https://envoy-mobile.github.io/ or a segment.com to emit constant metrics. But,

In a world where everything is becoming an API, the control you may exercise over the callers is diminishing.
There will be so many senders!! Before concluding from that pattern, whether it was a problem with the sender or one of your receivers, one would have to wait for the data to arrive and wait for it to be collated.
What if the sender’s network is jittery. Say 1% of the senders had an ISP fault and the stats never make it to your aggregates.

Waiting on a response to be aggregated across 100% of callers, with their own delay, where the goal is a 99.99% uptime SLO with only 4 minutes of downtime available, you may have already lost the uptime target.

Clearly, using this method will weaken the definition of uptime, where 99.99% will feel like a joke.

Firewall   -->   Load Balancer   -->   L7 Proxy   -->   Handler

In reality, the request never really reaches your server directly. There will be a Firewall upstreaming to a Load Balancer upstreaming to a L7 proxy upstreaming to your servers.

SLOs is an aggregate of all layers underneath

We should be talking about the SLO of each layer before the service as a whole. A breach in Uptime SLO of a backend service would impact NOT the uptime but the ErrorRatio of the calling L7Proxy. Similarly, if the L7Proxy is down, Load Balancer’s ErrorRatio will increase and not the Uptime. All the way till you reach the CDN, probably.

So, Uptime (as customer experiences) is best measured at the layer which is closest to the Customer and farthest from your code.

Each such layer, we are probably monitoring the uptime of a component which is not our core business unit and outside our control. In the -> LB -> CDN -> … journey, we have probably lost the uptime essence of the actual code deployed.

It’s possible that the business calls the SLO Uptime, but instead what you are addressing is the ErrorRatio!!

Option 2: Uptime (actually Downtime) Monitors

This is where you introduce uptime monitors and downtime checkers. Simple services that have existed forever, but extremely crucial. Outsource the trust of uptime to these services where some poor bot(s) have been assigned the mundane work of periodically hitting your service endpoint.

Uptime = timeUp / ( timeUp + timeDown ) %

This introduces two compromises:

Uptime is now as the monitor sees it, Not how the real customer sees it. Those 1% of ISP faults may still be affected.
How about the reliability of the uptime monitor? If we have trouble staying up 100% of the time, sure they cannot be up 100% of the time.

Say, uptime monitor guarantees an uptime of 99.99%, What if those 4 minutes don’t overlap? For the 4 minutes that the uptime monitor was down for, your service may be up or down, the monitor would not know.

You don’t keep all your eggs in the same basket. We introduce a multi-geography downtime monitor. Say a downtime monitor requests from 4 Geos.

Should a failure across 1 Geo- be called a downtime?
What if one of the Geos is highlighting a downtime for a geographical region?
Also, CAP and Network failures.

Downtime monitors aren’t holier-than-thou. They go through glitches too. They need to retry too. Say if a request failed, wisdom says retry. Wisdom says Hystrix. But what about the failed attempt? Was it a failure or was it counted as success?

Before we proceed, there is a frequency too. How often do you check? We cannot go per second. We cannot go per minute. It’s a balance that you pick.

The faster you check, the shallower will be your health check.
The slower you check the deeper can be your health check.

Depth of a health check is basically a trade-off you make. A shallow check only checks for a static response. A deeper health check will check for a DB operation in and out. The DB call obviously will take considerably longer and eat a transaction. You can’t have that come in every 10 seconds.

The frequency vs depth is an argument that doesn’t have one right answer. And like other things, you may just need both.

Option 3: State-based Monitors

Because the uptime is not requesting every ms, we count the time between states of the service.

We need states. OK, Unknown, and Error.

Error is confirmed down. The Unknown could be any emission that was interrupted or pending for retry or the duration that was in-between maintenance.

Time spent between OK and Unknown is OK since you cannot tell for sure.
Time spent between Unknown and Error is Error.
Time spent between Error and Unknown is Error since you cannot tell whether service is Up or not.
Similarly, the time between Error to OK is considered down, since it was down.

This in itself is the first step of aggregation.

Let’s take these two situationss

Counted as OK for 20. Uptime = 100%

┌────────┬──────────┬──────────┬──────────┬──────────┐
│ Status │    OK    │    OK    │ Unknown  │    Ok    │
├────────┼──────────┼──────────┼──────────┼──────────┤
│ Time   │ 10:00:01 │ 10:00:10 │ 10:00:11 │ 10:00:21 │
└────────┴──────────┴──────────┴──────────┴──────────┘

2. This will be counted as OK for 10 secs and Down for 10 secs. Uptime: 50%

┌────────┬──────────┬──────────┬──────────┬──────────┐
│ Status │    OK    │ Unknown  │   Down   │    Ok    │
├────────┼──────────┼──────────┼──────────┼──────────┤
│ Time   │ 10:00:01 │ 10:00:10 │ 10:00:20 │ 10:00:21 │
└────────┴──────────┴──────────┴──────────┴──────────┘

You can keep these rules to be configurable per service. But these are rules and subject to Interpretation. The absolute 99.99% is long gone!

I have not discussed the condition where downtime monitor’s sleep period overlaps with your actual downtime. A unique situation where every hit that comes in is only a fraction of the actual flapping status. The uptime figures that will come, are going to be really skewed from reality.

Conclusion

There is not one single SLO. They are formed at layers, and uptime SLO of one could be error SLO of another.
The uptime number is massively aggregated, and always approximate.
As your uptime reaches the higher 9s, the support structure and the mindset needs to shift towards proactive efforts, since waiting on an outage and then reacting to bring it up will not always work.

Last9 is a Site Reliability Engineering (SRE) Platform that removes the guesswork from improving the reliability of your distributed systems.