Mean vs Median (And I Guess Mode…)

Time For Some More Math

If it isn’t apparent by now that I like math and I think about it a lot then I guess you don’t know me too well or haven’t been reading. To set the record straight, I am most definitely one who has a strong affinity toward math, and more so to the results it creates in life for us.

I play with lots of simple math all the time. Of those simple math techniques learned way back in 5th grade the ones that come up the most are typically mean and median.

What is the difference and which one is better? Well lemme tell ya.

This is how I learned them and still remember what each of them signify.

MEAN = The mean ole’ average

MEDIAN = The middle number

MODE = The most often

So that’s pretty simple, and as you learned in 5th grade their calculations are relatively simple as well, that is depending on your data set. I’d highly recommend letting computers do those calculations, mostly because they are just way good at it.

MEAN aka Average: This is calculated by adding up all the numbers in your data set and dividing it by how many numbers you have.

MEDIAN: This is calculated by lining up all your numbers in the data set sequentially and then picking the middle number of that data set.

MODE: This is calculated by finding the number that repeats itself most often within your data set.

So we tend to not care about mode that often, it can be important but we’ll save that for another day. The debate, of which is isn’t much of one, is really between mean and median.

Typically everyone calculates the average. The average is great except its average, and it’s not always really indicative of what you are trying to conclude. It is a good one number summary of a data set but it doesn’t always accurately represent the truth of that data set. Here’s why, outliers.

Now if you remember what an outlier is then you’ll know that it is a number that is substantially greater or lesser than the majority of your data set. So if our data set is 1, 16, 18, 20, 33, 22, 27, 29, 24, 17, 19, 88. I would consider 88 a definite outlier and also 1 could debatably be an outlier as well.

So if we take the average of this data set we would get 26.17. Averages include outliers, so by looking at this number we would be inclined to think that more often than not our value for this data set would be ~26, in such thinking we would be a bit off.

Since the MEDIAN is the middle number of the data set it tends to exclude outliers in a sense and get us a clearer picture of what we are really looking at. In the case of this data set our median is 21. That is quite a bit lower than our average, in fact it’s -19.7% lower than the average.

If this is a revenue modeling situation and you estimate your revenue based on an incorrect assumption of a 20% higher model because you used the average you may be in for some big trouble when you find that your model was 20% too high.

So what do I do? I always rely on the MEDIAN first and foremost, but as I mentioned before, I’m not doing these calculations by hand, the computer and models are running them for me so of course I just calculate both. The metric that I have started looking at more is SKEW.

SKEW is just as it sounds, which way does the data lean. You can calculate SKEW by taking the MEAN and subtracting the MEDIAN. If you have a positive value then you know your data leans to the “right” or positive side. If you have a negative value you know your data leans “left”. This is important because you can get a simple quantitative summary of the trend of your data.

In the case of our data set above, our skew is 5.17. It’s (a) positive, and (b) has a value of 5. So we know that the data set lies on average a bit higher than our median. Also indicating there are some bigger numbers pulling up the average.

MEDIAN is always a safer way to go. Calculate your median before you model anything, and make sure you are looking at the numbers that will give you the clearest picture. Don’t torture your data and think it’ll be nice to you moving forward, rather look at an accurate data set.