How not to measure progress
aka ‘don’t use only P50/Avg'
Disclaimer: This entire article is about the measurements in the software/hardware industry to measure an issue or impact.
Question: What is the difference between a quack and a doctor?
Answer: A quack ‘s snake oil works sometimes and a doctor’s suggestion works more times.
The joke’s paunchline derives its spirit from the fact that even a dead clock is accurate twice a day and even the best of the people/system could be wrong sometimes. The line is intended to be a joke because of course, we do trust Doctors a lot more than quacks as Doctors of this age tend to be more reliable.
Henry the 7th had a common guild for barbers and doctors in the 15th century because they were much the same in that era, but I digress from the main topic.
The chief point I want to talk about here is, how counting the average does more harm to your customers than, surprise!, not counting anything.

To be clear P50 is different than average. Technically speaking P50 is called median and it roughly says half the elements have higher value and half the elements have a lower value. The “average” on its own has no meaning, but it can come in the variety of ‘arithmetic mean’, ‘geometric mean’, ‘harmonic mean’, ‘truncated mean’ and so on. Almost always, when someone says average, they mean to say ‘arithmetic mean’ but you don’t have to correct them. The ‘arithmetic mean’ means (no pun intended) the magical number that if multiplied with the count of objects in a set, it equals the sum of the numbers in the set. For the intended purpose of this discussion, I will use them interchangeably because I consider both of them similar due to their misleading nature, and the unintended consequence.

A tidbit about the word ‘Average’ is it had original meaning of ‘damage’, which was used in English for the calculation of average/damage in a fleet during a voyage, for insurance calculation purposes. It has lost its original negative connotation, but always remember that when you are talking about the average behavior of a product, about half of the customers are suffering worse than what your average number states.
Why you should shun P50/average only
The negativity bias, tells us, the customer will most likely remember that one time you bailed out on him, but not when you did stand by him. From the evolution point of view avoiding a lake where you saw a tiger once, is worth giving a miss to all the bananas surrounding that lake. Yet it is hard to internalize the message, that the customer can remember the rare bad experience naturally, possibly overshadowing all the good things you have built.

The harm in using only P50/Avg comes from three more reasons:
- Resources are finite: When you are using only P50/Avg as a measurement of some progress, you are not incentivizing your engineers/teams to work on the outliers, which are more likely to desert your products. From the real-world experience, I can tell you, during my time at eBay, we found that P99 latency of a service was a direct representative of the eventual timeout in checkout and hence the source of disappointed and deserting customers. The company goals around cutting down “average” time, which did not help the customers facing timeouts.
- False confidence: When you are tracking P50/Avg, you think you are improving the system, whereas the impact on the worst affected customers could be nil. In Java world, you could add some servers to reduce the P50, leading to false confidence, without solving the OOM issues which affects the worst of the use cases. When you are not tracking P50, at least everyone is keeping their eyes open to find out the issues that could affect the time taken for each case, instead of being content with the progress made on P50 improvements. Even worse could be an executive, out of touch, can think, ‘we are all good’ just by reading the P50/avg numbers.
- The forgotten segment: For companies, who have a smaller business in Canada compared to the USA, absolute bad latency times for Canadian customers will not move the average latency time, and the whole customer segment of Canada (whatever may be your segment), could be lost to competitors. I'm sure you have seen this, and if you can think of similar examples feel free to drop them in the comment section.
Skews and measuring skews
The magnitude of misleading when using P50/Avg also depends upon the nature of the data. The way to measure it is by considering the skew in the data. The skew is measured as asymmetry in the statistical distribution, and the unit for that is kurtosis. In a nutshell: Kurtosis is a measure of the size of the ‘tail’ in your data. I was fortunate to have worked in two companies in the same domain (e-commerce) but very different kurtosis. At eBay, items prices had negative kurtosis (fat tail == large number of outliers) with positive skew (secondhand items == lower prices)and in Amazon which had positive kurtosis with negative skew (some luxury item with very high price). All of this is, of course, a gross generalization. If your data has strong skew and/or high kurtosis, working on P50/average you are doing more harm to the customer.


So.. what do you suggest?
Dunning-Kruger effect is for real when you start your project measurements. Know your numbers, and their distribution well. If in doubt, go with the industry standard of P50/P90/P99(P99.9). It's sometimes better to not publish P50 alone, than doing so.
Summary
Do you remember that spelling mistake at the beginning? ‘paunchline’ vs ‘punchline’? Chances are you remembered that one word with a spelling mistake. The whole point is by tracking averages, you are forgetting about everything else that your customer is going to remember.
References
Barbers & surgeon: http://broughttolife.sciencemuseum.org.uk/broughttolife/people/barbersurgeons
Kurtosis : https://community.sw.siemens.com/s/article/kurtosis
Skew: https://www.researchgate.net/figure/llustration-of-skewness-and-kurtosis_fig2_49619305
Note on Median vs Average
In some circumstances (high skew), Median can differ from Average (arithmetic mean)by a lot. E.g. India’s wealth distribution has ‘negative skew’, and the average wealth owned by an Indian is about $14k, but the real horror is, the median wealth per Indian is $3k. Meaning half of the adult Indians own less than $3k. So if you have to use one of these two, using P50 is always better than average, if the measure relates to the human experience.