The Second Law of Latency: Latency distributions are NEVER NORMAL

Brian Taylor

Follow

Published in

Engineers @ Optimizely

4 min readJun 29, 2022

--

The default latency visualization provided by a big cloud provider for a service that I maintain.

In the First Law of Latency we remembered that when discussing latency in general we should always be using the language of statistical distributions. Now we are going to dive deeper into the sort of language of statistical distributions that we should be using.

It is commonplace to see “average latency” presented in various forms when discussing the latency of a system. This is wrong thinking that can lead to incorrect decision making because Latency Distributions are NEVER NORMAL.

The Laws

There is no THE LATENCY of the system
Latency distributions are NEVER NORMAL
DON’T LIE when measuring latency (most tools do… and that’s not ok)
DON’T LIE when presenting latency (most presentations do… and that’s not ok)
You can’t capacity plan without a LATENCY REQUIREMENT
You’re probably not asking for ENOUGH 9s.

Averaging is always the wrong way to aggregate latencies, but everyone does it. Let’s start a grassroots movement to stop making this mistake.

“But, why is average unhelpful?”

Average is an example of a summary statistic. Summary statistics are lightweight tools that let us describe the gist of the data. Average, in particular, tells us where to find one center of a dataset. This particular center is most useful if the data happened to be normally distributed.

When data is normally distributed and you know average (center) and standard deviation (spread) then you know where to find all of the data in the distribution. Source.

The second law of latency is: Latency distributions are NEVER NORMAL.

“But the average still tells me something!”

Latency data from a real system I maintain superimposed with a normal distribution with the same mean and standard deviation as the real data. The green line is not a good summary statistic for the blue data because the normal distribution is not a good fit for the blue data.

Technically average does tell you something, but for non-normally distributed data, the average is just a random number somewhere between the minimum and the maximum observed values. We need to be on stronger footing than this to make decisions.

Let me illustrate: Let’s say we have a sensibly defined latency requirement like ≤100ms 99% of the time.

What if our average latency is 1 second, are we violating our requirement?

A system whose latencies are periodically 3 orders of magnitude higher than typical (think time based cache expiration). This system has an average latency of 1 second and a 99% latency of 100ms. This system is meeting our requirement.

The system yielding the latency measurements above (with an average latency of 1 second) is meeting our latency requirement (≤100ms 99% of the time.)

What if our average latency is 10ms: are we meeting our requirement?

Typical noisy system. The average latency is 10ms but the 99% latency is 279ms.

The system yielding the latency measurements above (with an average latency of 10ms) is not meeting our latency requirement (≤100ms 99% of the time.)

Thus, a system with a high average latency can be meeting our 99% requirement and a system with a low average latency can failing our 99% requirement. This demonstrates that average is useless for evaluating our 99% requirement.

“But what if my latency requirement was specified as an average?”

You need better requirements. We create requirements so that downstream systems can plan and so that we can guarantee a level of experience for our users. Creating an average latency requirement is useless because you have no idea how meeting that requirement relates to the experience of people who use your system.

When we guarantee the experience of 99% of requests we’re saying it’s okay for up to 1% of requests to be worse than that. If we guarantee an average latency then we have no idea what percent of requests will be worse than our guarantee (because latency is not normally distributed.) Thus, we have no idea how meeting our average latency guarantee impacts our users.

“But I know something is broken when average latency jumps up in a time series.”

Typical dashboard-style time-series aggregations. If all we know mean latency and 99% latency over time then there’s no way to know if we are meeting our 99% requirement of 100ms (we aren’t.)

No, you don’t. In fact, you tend to know even less when you smear (aggregate) your data out over time. This is even true if you smear the data with 99% quantiles. I will explain why in the Fourth Law of Latency: DON’T LIE when presenting latency.

Remember the second Law of Latency: Latency distributions are NEVER NORMAL. Because of this, averaging latencies is always the wrong choice.

The average latency is just a random number somewhere between the minimum and maximum latency and no sensible decisions can come from that information.

Next week we’ll talk about the third Law of Latency: DON’T LIE when measuring latency and we’ll unmask a real global conspiracy known as “coordinated omission”.

The visualizations in this article were created by this colab notebook.

The Second Law of Latency: Latency distributions are NEVER NORMAL

The Laws

Written by Brian Taylor