The mean misleads, part 2: more data for the doubters
Some background
A few weeks ago, I published The mean misleads: why the minimum is the true measure of a function’s run time, where I explained that taking the means of the run times of two functions (with the aim of selecting the faster one) will produce incorrect results more often than if you take the minimum.
This was met with quite a few people telling me that I was wrong, which meant one of two things:
- I was wrong
- I did a poor job of explaining why the mean is a bad metric
To assess my wrongness, it would help if I knew in what way I was wrong. But here’s a curious thing: of all the people who told me I was wrong, not one of them offered any evidence.
But upon reflection, I suppose I didn’t really provide any convincing evidence either. The first post contained a lot of logic and reasoning and rather complicated charts. I suspect that the open-minded folks who took their time to soak it all up saw the point. But when skimmed, the article was easy to dismiss. Particularly if a reader had been primed by Reddit comments, which were unanimous in their decision that I was wrong.
The other thing I failed to get across was that I was comparing the mean and the min, and not saying that these were superior options to analyzing your data and using your brain to make intelligent decisions. But many seemed to think that’s exactly what I was suggesting. So just to be clear: the best thing you can do is analyze your data, think about what really matters when comparing the functions. That’s the first prize.
This pair of articles is about the battle for second place. It’s for people who choose not to analyze the run times of functions and instead just want a single-value aggregation so that they can select the ‘faster’ function. For these people, my goal is to find out whether the mean, min, or something else is the best statistic.
What’s new in this article?
Unlike the first article, this one contains no logic or reasoning, no synthetic data, no hypotheses about ideal run times, positive noise, and asymmetric distributions. Just real data and charts.
If you’re skeptical that the mean misleads, you can skip all the text in this article and just look at the pictures while pondering the question: how often is the green mean line on top?
A summary of the results is below, with deeper explanations further down the page.
Experimental setup
The data
To generate data, I created nine small functions performing various common tasks. I ran each function 100,000 times, recording the length of each run.
I then cloned this dataset thrice to create variants with each function’s data shifted by 1%, 5% and 10% of the function means, to simulate slower functions. That is, if you took the total run time of one function in the 1% shifted dataset, you’d get a number 1% bigger than the total for that function in the base dataset.
For the analysis walkthrough below, I will use the base dataset and the one shifted by 1% to represent the pairs of functions being compared. At the end, I’ll show summaries using the 5% and 10% variants.
The analysis
For each of the nine functions, I will present three charts.
The first chart represents the accuracy of min, mean, and median statistics for the function, across a range of sample sizes.
The first of the nine functions, PyTorchCUDAMatMul
is a function that creates two random tensors and matrix-multiplies them on the GPU.
The three very dull lines are the noisy raw data, the three brighter lines are a smoothed (10-step rolling average) view of the raw data.
Note that the x-axis is on a log scale and that the colours rhyme perfectly with the data they represent: mean is green, min is orINge, median is, um, cerulean.
The process to calculate accuracy is:
- For each function, I run experiments for different sample sizes, from 2 to 1,000.
- For each sample size n, I randomly select n run times for that function (from the pool of 100,000) and take the min/mean/median of those n values.
- I repeat this 100 times for each sample size, and repeat it all for the 1% slower variant of the function data.
- For each of these repeats I compare the aggregations and see if they report the correct results. ‘Correct’ means that the statistic for the slow function variant was larger than for the fast variant. That is, if I was asking “which function is faster”, what percentage of the 100 repeats reported “the fast function”.
The next chart is the distribution. This is plotted from all 100,000 run times, so is generally a nice clear shape.
The distribution for all functions is left-skewed, as we would expect (see the first post for the reasoning behind this). Here it is for PyTorchCUDAMatMul
:
Since the skew of the distribution is reportedly a predictor for whether the mean or min will perform better, it’s tempting to think that we can just measure the skew and know which aggregation to use. This brings us to the the third and final visualization:
This is on the same log x-axis, although there are no values for skew when the sample size is only 2. I will show all functions on the same y-axis scale of 0 to 10.
You might think that skewness is a property of the underlying distribution, and expect no correlation with sample size. This is roughly true for normal distributions, but not for lognormal (which is a reasonable fit for many function run time distributions), and so we see larger skew values for larger sample sizes.
If you’d like to question these methods, or selection of functions, that’s good! Please run your own experiments with your own functions and show me some charts. I genuinely want to know what works best in different situations.
Results
I probably don’t need to show you all these charts, but if I did anything other than present each function I tested, that would be introducing some sort of bias. Plus I think the variance between them should give a good understanding of how misleading any aggregation statistic can be.
Remember that these are all comparing two functions only 1% apart, I will later show results for differences of 5% and 10%.
I will show the same triplet of charts for each of the remaining functions. I’ll keep my words to a minimum (as much as I’m capable) and leave you to soak up the information in the charts.
The WriteReadSmallFile
function writes “Hi there” to a text file and reads it back.
This has the most interesting distribution, with a real step in the left curve. As a result, taking the min, with a sample size of about 50 or more gave lower accuracy.
The StatisticCorrelation
function calculates the correlation between two vectors.
The distribution is multimodal (many humps) with quite a long right tail. This is not good for mean/median, but curiously, the results for the min get worse for larger sample sizes. I can’t quite wrap my head around why this happens.
The PickleTensor
function pickles a tensor (serializes a large array).
Of all the functions, this one would make the best name for an synth-pop duo.
As with StatisticCorrelation
, there’s a steep left side to this distribution, making the min a good indicator. And as we’ll see in many charts, for higher sample sizes the median takes the lead.
The LocalHttpRequest
function requests an html file from an http server running locally.
Comparing this with the previous charts is a good example of skewness not being a reliable indicator. Imagine you had a sample size of 50, both cases report a skew of around 2, with wide margins. But PickleTensor
has that steep left side of its distribution resulting in min being more reliable, whereas the LocalHttpRequest
tapers off toward the left, meaning the min has greater variance and therefore lower accuracy.
The ImportPandas
function imports the Python Pandas module (clearing the module cache each time).
Little difference at lower sample sizes, with min and mean performing similarly poorly by sample size 1,000 and median reaching about 100% accuracy.
The Primes7k
function calculates the primes up to 7,000.
With a relatively symmetric distribution, the min performs poorly while the median does much better.
The SpStatsFit1M
function uses the SciPy Python package to fit a normal distribution to an array of 1 million values.
There’s quite a steep left side to this distribution, resulting in the min being the most accurate.
The MySQLWrite
function creates a connection to a local SQLite database and writes a single value, then closes the connection (recreating a table each time).
This has some serious outliers (I have clipped the x axis of the histogram to 100). Most of the time the function ran in about 60ms. But the longest runtime was 5,453ms, and 1,000 runs took longer than 1,440ms.
You may want to consider this an invalid test due to the outliers, but I include it here because the results are quite interesting. Specifically that the median can get to 100% accuracy while the mean’s accuracy is below 60%.
Also I hope this drives home the point that first prize is to analyze your data. If you haven’t even looked at the values you’ve collected, maybe taking the mean to compare two functions is as accurate as flipping a coin. You don’t know if you don’t look.
Lastly, the Combined
function adds all the times together for the above functions, to represent a more complex function. I exclude the SQLiteWrite
run times, since they would otherwise dominate the results and were a bit dodgy anyway.
Comparisons with other shifts
At the start I showed a summary of all 10 functions, reporting the accuracies in detecting a 1% increase in a function’s run times. Here it is again for reference:
And here is the same chart, but comparing variants of each function 5% apart (the scales are the same):
Predictably, we get more reliable results with smaller sample sizes compared to a difference of only 1%. Also note that the min excels over the median (and of course the mean) in most cases. It seems that with very small differences between two functions, the min struggles, but at larger differences the theory behind why the min is the best metric is borne out in the data.
And finally, the same chart again but with a difference of 10% between fast and slow:
This continues the trend of the min being more accurate for larger differences.
For those who will continue to use the mean (I acknowledge your existence and respect your choice to be different) you can view this chart as good news: when the gap between functions is large, and your sample size is triple digits, even the mean is reasonably accurate a lot of the time.
Of course, you won’t know in advance if you’re comparing two functions with a 1%, 5% or 10% difference. As shown in the first post, the mean will report wildly inaccurate differences for samples sizes of just a few dozen.
Note: I ran these same tests with harmonic mean and geometric mean and in almost all cases they were both between median and mean, so I have left them off for the sake of readability.
Summary
I don’t think these results really need any commentary, but I’ll tell you my takeaways from these experiments (my goal here was to work out what works best so I can make smart decisions, convincing you a lot of the same is a secondary concern).
Keep in mind that this is only a small selection of functions and not necessarily representative of all functions, so extrapolate at your own risk.
My notes:
- There is no clear winner, but there is a clear loser.
- For sample sizes up to 10 (and a difference of 1%), the min is usually the most accurate, but there’s often not much in it.
- For sample sizes around 50 (and a difference of 1%), the median and min are about on par.
- For sample sizes over 500 (and a difference of 1%), the median is most accurate.
- For larger differences between functions (10%+), the min is almost always the most accurate.
- There were no cases — for any function, sample size, or difference — where the mean was the most accurate. Using the mean to aggregate function run times really is a bad idea.
Well, that’s it, thanks for reading.