We can do better than percentile latencies

Yan Cui
theburningmonk.com
Published in
8 min readOct 3, 2018

Years ago, I used aver­age laten­cy on every dash­board and every alarm. That is, until I woke up to the prob­lems of aver­age laten­cies along with every­body else in the indus­try:

  • When the dataset is small it can be eas­i­ly skewed by a small num­ber of outliers.
  • When the dataset is large it can hide impor­tant details such as 10% of your users are expe­ri­enc­ing slow respons­es!
  • It is just a sta­tis­ti­cal val­ue, on its own it’s almost mean­ing­less. Until we plot the laten­cy dis­tri­b­u­tion we won’t actu­al­ly under­stand how our users are expe­ri­enc­ing our sys­tem.

Nowa­days, the lead­ing prac­tice is to use 95th or 99th per­centile laten­cies (often referred to as tail laten­cy) instead. These per­centile laten­cies tell us the worst response time 95% or 99% of users are get­ting. They gen­er­al­ly align with our SLOs or SLAs, and gives us a mean­ing­ful tar­get to work towards — e.g. “99% of requests should com­plete in 1s or less”.

Using per­centile laten­cies is a big improve­ment on aver­age laten­cies. But over the years I have expe­ri­enced a num­ber of pain points with them, and I think we can do bet­ter.

The problems with percentile latencies

The biggest prob­lem with using per­centile laten­cies is not actu­al­ly with percentile laten­cies them­selves, but with the way it’s imple­ment­ed by almost every sin­gle ven­dor out there.

Percentile latencies are “averaged”

Because it takes a lot of stor­age and data pro­cess­ing pow­er to ingest all the raw data, most ven­dors would gen­er­ate the per­centile laten­cy val­ues at the agent lev­el. This means by the time laten­cy data is ingest­ed, they have lost all gran­u­lar­i­ty and comes in as sum­maries — mean, min, max, and some predefined per­centiles. To show you the final 99th per­centile laten­cy, the vendor would (by default) aver­age the 99th per­centile laten­cies that has been ingest­ed.

You can’t aver­age per­centiles, it doesn’t make any sense! This whole prac­tice gives you a mean­ing­less sta­tis­ti­cal val­ue, and it’s in no way the true 99th percentile laten­cy of your ser­vice. Aver­ag­ing the per­centiles inher­it all the same prob­lems with aver­ages that per­centile laten­cies were sup­posed to address!

I have seen 99th per­centile laten­cies dif­fer by order of mag­ni­tude depend­ing on how I choose to aggre­gate them. Seri­ous­ly, how am I sup­posed to trust this num­ber when choos­ing the max over aver­age can pro­duce a 10x dif­fer­ence! You might as well stick a ran­dom­ly gen­er­ate num­ber on the dash­board, it’s almost as mean­ing­ful as the “the aver­age of 99th per­centiles”.

This prac­tice is so wide­spread, almost every mon­i­tor­ing tool I have tried does this. Hon­ey­comb is one of the few excep­tions because they actu­al­ly ingest and process all the raw events.

Can’t tell how bad the bad times are

It’s great that we can use per­centiles to mon­i­tor our com­pli­ance with SLOs/SLAs. When things are going well, it gives us that warm and fuzzy feel­ing that all is well with the world.

But when they go wrong, and some­times they go very wrong, we are left won­der­ing just how bad things are. Are 10% of my users get­ting response time of 1s and above? Is it 20%? Could it be 50% of my users are get­ting a bad expe­ri­ence? I just don’t know! I can use var­i­ous per­centiles as gates but that approach only goes so far before it over­whelms my dash­boards.

Most data points are not actionable

As much as I love to stare at those green tiles and line graphs and know that:

  1. We have done a good job, go team!
  2. Everything’s fine, there’s no need to do any­thing

Indeed, most of the infor­ma­tion I con­sume when I look at the dash­board, are not imme­di­ate­ly action­able.

To be clear, I’m not say­ing that per­centile laten­cies are not use­ful and that you shouldn’t show them on dash­boards. But as the on-call engi­neer, my atten­tion is heav­i­ly biased towards “what is wrong” than “what is right”. I want dashboards that match my focus and not force me to scan through tons of infor­ma­tion and pay the cog­ni­tive price to iden­ti­fy the sig­nals from the noise.

As an appli­ca­tion devel­op­er, my def­i­n­i­tion for “what is wrong” is quite different. As an appli­ca­tion devel­op­er, I’m look­ing for unex­pect­ed changes in appli­ca­tion per­for­mance. If the laten­cy pro­file of my ser­vice changes after a deploy­ment, or oth­er relat­ed event (e.g. a mar­ket­ing cam­paign, or a new feature being tog­gled on) then I need to inves­ti­gate those.

This dichoto­my in what’s impor­tant for ops engi­neers and appli­ca­tion developers means we should have sep­a­rate dash­boards for each. More on this lat­er.

What can we do instead?

What could we use instead of per­centiles as the pri­ma­ry met­ric to mon­i­tor our application’s per­for­mance with and alert us when it starts to dete­ri­o­rate?

If you go back to your SLOs or SLAs, you prob­a­bly have some­thing along the lines of “99% of requests should com­plete in 1s or less”. In oth­er words, less than 1% of requests is allowed to take more than 1s to com­plete.

So what if we mon­i­tor the per­cent­age of requests that are over the thresh­old instead? To alert us when our SLAs are vio­lat­ed, we can trig­ger alarms when that per­cent­age is greater than 1% over some pre­de­fined time win­dow.

Unlike per­centiles, this per­cent­age can be eas­i­ly aggre­gat­ed across mul­ti­ple agents:

  • Each agent sub­mits total request count and num­ber of requests over thresh­old
  • Sum the two num­bers across all agents
  • Divide total num­ber of requests over thresh­old by total request count and you have an accu­rate per­cent­age

Dur­ing an out­age, when our SLAs are impact­ed, this met­ric tells us the number of requests that have been affect­ed. Once we under­stood the blast radius of the out­age, the per­centile and max laten­cies then become use­ful met­rics to gauge how much user expe­ri­ence has been impact­ed.

Move aside, error count

We can apply the same approach to how we mon­i­tor errors. For any giv­en system, you have a small and finite num­ber of suc­cess cas­es. You also have a finite num­ber of known fail­ure cas­es, which you can active­ly mon­i­tor. But then there are the unknown unknowns — the fail­ure cas­es that you hadn’t even realised you have and wouldn’t know to mon­i­tor!

So instead of putting all your efforts into mon­i­tor­ing every sin­gle way your sys­tem can pos­si­bly fail, you should instead mon­i­tor for the absence of a success indi­ca­tor. For APIs, this can be the per­cent­age of requests that do not have a 2xx or 4xx response. For event pro­cess­ing sys­tems, it might be the percentage of incom­ing events that do not have a cor­re­spond­ing out­go­ing event or observ­able side-effect.

This tells you at a high lev­el that “some­thing is wrong”, but not “what is wrong”. To fig­ure the “what”, you need to build observ­abil­i­ty into your sys­tem so you can ask arbi­trary ques­tions about its state and debug prob­lems that you hadn’t thought of ahead of time.

Different dashboard for different disciplines

As we dis­cussed ear­li­er, dif­fer­ent dis­ci­plines require dif­fer­ent views of the system. One of the most impor­tant design prin­ci­ples of a dash­board is that it must present infor­ma­tion that is action­able. And since the action you will likely take depends on your role in the orga­ni­za­tion, you real­ly need dashboards that show you infor­ma­tion that are action­able for you!

Don’t try to cre­ate the dash­board to rule them all by cramp­ing every met­ric onto it. You will just end up with some­thing nobody actu­al­ly wants! Instead, con­sid­er cre­at­ing a few spe­cialised dash­boards, one for each dis­ci­pline, for instance:

  • Ops/SRE engi­neers care about out­ages and inci­dents first and fore­most. Action­able infor­ma­tion for them would help them detect inci­dents quick­ly and assess their sever­i­ties eas­i­ly. For exam­ple, per­cent­age of requests that are over the thresh­old, or the per­cent­age of requests that did not yield a suc­cess­ful response.
  • Devel­op­ers care about appli­ca­tion per­for­mance. Per­centile laten­cies are very rel­e­vant here, as are oth­er resource met­rics such as CPU and mem­o­ry usage met­rics.
  • Prod­uct own­ers and busi­ness ana­lysts might also need their own dash­boards too. They care about busi­ness met­rics such as reten­tion, con­ver­sion rate, or sales.

Summary

When you go to see a doc­tor, the doc­tor would try to ascer­tain (as part of the diag­no­sis):

  • What, and where your symp­toms are.
  • The sever­i­ty of your symp­toms.
  • How long have you expe­ri­enced these symp­toms.
  • Any cor­re­lat­ed events that could have trig­gered the symp­toms.

The doc­tor would use these infor­ma­tion to derive a treat­ment plan, or not, as might be the case. As the on-call engi­neer deal­ing with an inci­dent, I go through the same process to fig­ure out what went wrong and how I should respond.

In this post we dis­cussed the short­com­ings of per­centile laten­cies, which makes it a poor choice of met­ric in these sce­nar­ios:

  • They are usu­al­ly cal­cu­lat­ed at the agent lev­el, and aver­aged, which produces a non­sen­si­cal val­ue that doesn’t reflect the true per­centile latency of my sys­tem.
  • They don’t tell you the impact of an inci­dent.

We pro­posed an alter­na­tive approach — to mon­i­tor ser­vice health by track­ing the per­cent­age of requests whose response time is over the thresh­old. Unlike per­centiles, this met­ric aggre­gates well when sum­maris­ing results from multiple agents, and gives us a clear pic­ture of the impact of an out­age.

We can apply the same approach to how we mon­i­tor errors. Instead of monitoring each and every error we know about, and miss all the errors we don’t know about, we should mon­i­tor for the absence of suc­cess indi­ca­tors.

Unfor­tu­nate­ly, this is not how exist­ing mon­i­tor­ing tools work… For this vision to come to pass we need sup­port from our ven­dors and change the way we han­dle mon­i­tor­ing data. The next time you meet with your ven­dor, let them know that you need some­thing bet­ter than per­centile laten­cies ;-) And if you know of any tools that let you imple­ment the approach I out­lined here, please let me know via the com­ments below!

Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.

You can contact me via Email, Twitter and LinkedIn.

Check out my new course, Complete Guide to AWS Step Functions.

In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, design patterns and best practices.

Get your copy here.

Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.

You can also get 40% off the face price with the code ytcui.

Get your copy here.

--

--

Yan Cui
theburningmonk.com

AWS Serverless Hero. Follow me to learn practical tips and best practices for AWS and Serverless.