The 3 common measures of central tendency used in statistics are :
- 1. Mean
- 2. Median
- 3. Mode
There are of course other methods as the Wikipedia page attests. However the inspiration for this post was from yet another jdcooks’ blog ..
The point being what you choose to describe your central tendency is key and should be decided based on what you want to do with it. Or more precisely what exactly do you want to optimize your process/setup/workflow for, and based on that you’ll have to choose the right measure. If you read that post above you’ll understand that:
- 1. Mean — Mean is a good choice when you want to minimize the variance(aka, squared distance or second statistical moment about central tendency measure).. That’s to say your optimization function is dominated by a lot of square of distance(from central tendency measure) terms. Think of lowering mean squared error. and how it’s used in straight line fitting
- Note that even within mean there are multiple types of mean. For simplicity I’ll assume mean means arithmetic mean (within the context of this post).
- 2. Median — Median is more useful if your optimization function has distance terms but not squared ones. So this will in effect be the choice when you want to minimize the distance from central tendency.
- 3. Midrange — Midrange is useful when your function looks like max(distance from central measure)..
If most of that sounded too abstract then here’s a practical application I can think of right away to use. Imagine you’re doing performance testing and optimization of a small API you’ve built. Now I don’t want to go into what kind of API/technology behind it or anything. So let’s just assume you want to run it multiple times and calculate a measure of central tendency from it and then try to modify the code’s performance(with profiling + different libraries/data structures whatever….), so what measure of central tendency should you pick?
- Mean — Most Engineers would pick Mean and in a lot of cases it’s enough but think about it. It optimizes for variance of run/execution time. Which is important and useful to optimize in most cases, but in some cases may not be that important.
- Mode — An example is if your system is a small component of say a high-frequency trading platform and the consumer of it has a timeout and fails if it times out.(aka your api is mission-critical, it simply cannot fail). Then you want to make sure even in the lowest case your program completes. If the worst case runtime complexity is what you want to lower then you should pick mode. (Note this is still a trade-off over not lowering the average/mean use-case, just like hard-choice.)
- Median — This is very similar to Mean, except it doesn’t really care about variance. If you’re picking median, then your optimized program is sure to have the best performance in the average run/case/dataset
- Midrange — Well this is an interesting case. Think about it.. even in the previous timeout example i mentioned this could be useful. Here it goes,suppose your api is not mission-critical(i.e: if it fails the overall algorithm will just throw out that data term and progress with other data sources). when you want to maximize the number of times your program finishes within the timeout. i.e: you’re purely measuring the number of times you finish/return a value within the timeout period. You don’t care about the worst-case scenario.
Originally published on Wordpress