Understand the reason why we calculate Mean, Median and Mode in Statistics

Femiloye Oyerinde
7 min readMay 26, 2020

--

Find answers to the questions of why and how to calculate center of data

As a Machine Learning Enthusiast, one of the ways you can land yourself on the global map is by having sound understanding/knowledge of both mathematics and statistics (there are more to sound machine learning practice than just knowing statistics and mathematics).
As a matter of fact, those two are not supposed to pose a threat, right? Yes, they shouldn’t be a problem; we’ve all had encounters - little or big - with them during our days in senior schools. But maybe I’m wrong! I’m pretty sure some of your younger ones in senior schools are already wrapping their heads around formulas and Greek notations.

Sadly, the case is different with us, yes, most of us did them, but the questions remain: do we really know them; do we know when and where to apply them and ultimately, do we know why we calculate them? I doubt if we do.

Now let’s come back to Machine Learning and like I said you’d definitely need a solid knowledge of statistics and mathematics to thrive in this mysterious and wonderful field.

Before you think I am some genius with a Phd. in mathematics I’ll like to add that the goal of this post is not to teach you mathematics or all you need to know about statistics, rather I’ll be explaining some basic and important terms in statistics that you’ve probably seen or used before, but you don’t know what they truly mean and why we use them. I’ll try as much as possible not to liter this space with formulas - all concept will be explained in words

With that said, now, let us slowly dive into what we call the Mean, Median, and Mode.
Back to the questions which I argued we don’t have answers to, I’ll this article provides answers to them in an understandable manner.

Measures of Center - When to use Mean, Median and Mode
In statistics Mean, Median, and Mode are essentially the techniques we use to measure the center of our data. Supposing we are given a dataset of m × n dimensions (where m stands for the number of rows and n stands for the number of columns) it's very important we know the centers of data in our dataset.

But here is the question - why should we care about the center of our data?
The center of data is a numerical value that typically describes and gives us insight as to where our data tends to. I'm quite aware this definition still looks somewhat abstract but follow me, you'll get it soon.

To simplify that here is a small task for you: you would need to imagine that you find yourself working with a dataset of, let’s say, 1000 by 10 dimensions (where rows=1000 and columns=10). Could you picture that? If yes, I am sure, now, you are asking yourself some questions like, ‘what am I going to do with this dataset? How will I know what the values in the dataset are preaching to me’ – sure, anytime you look at your screen to check the data displayed they are always saying things to you! The problem here is whatever they say are encoded and they are waiting for you to decode them, lol.

You will need to perform some statistical and mathematical operations on your dataset in order to push, the questions posing threats, out of your head.

Interestingly, one of the simplest operations you can carry out on your data is to calculate its center. It is highly important that for every of our column we have a numerical value that tells us where most of the values in there tend to - this numerical value is sometimes called "the average" - what I'm I saying? Every column should have a numerical value that describes where the values in it tend to and it is your job to find that value! Values giving us information about the average of individual column.

Wait wait! I'm still having doubts you're not getting what I'm saying. Is that true? It is not a problem if you don’t understand yet, I will give another, but, practical example to make everything become crystal clear.

Below is an array of data containing the temperature of a region for ten different days; what this means is each value in our data represents the temperature for a particular day:

Temperature = 1, 3, 5, 6, 8, 9, 7, 2, 8, 4

You remember the goal right? Our goal is to obtain a numerical value that describes and tells us where all the values in our data tend to.
Now here, let's apply one of the techniques we use in statistics to measure center of data - I've mentioned them above.

For this example we'd be using "Mean" (I'll will explain why we are using mean in the latter part of this post)

From our basic knowledge of statistics we can define mean as the summation of all numerical values in a given set of data divided by the number of values in the set.

So we can calculate the mean of data following as follow:
Mean = (1 + 3 + 5 + 6 + 8 + 9 + 7 + 2 + 8 + 4 ) / 10

Mean = 53/ 10
Mean = 5.3

Did you notice I did not re-arrange the numbers in any order? Obviously I did not and that’s because the arrangement or order of the numbers do not affect our calculation for mean.

Basically we can say we performed two (2) mathematical operations to calculate our mean, and they are:

  • Firstly, we summed all our numerical values together; and in our case we got 53.
  • Then secondly, we divided our summed value by the number of numerical values present in the data.

Finally, here comes the whole thing. Let's assume someone who knows you to be a data scientist comes to you (with the same temperature data above) and asks you for what the temperature of a particular day(between the range of the given 10 days) will most likely be with respect to the temperature data? Fortunately for you, you just did find it; from the mean you calculated – which is the center of your data - you can say that the temperature would be around 5.3 because on average those numbers up there tend towards 5.3.

Before we move on, let me point it out that the mean we calculated cannot accurately tell us what the temperature for that day will be without we calculating for the variation of our data; a technique that tells us how spread our data is. The goal of this post, however, is not to explain Measure of Variation but to explain why we calculate for Center of data.

How to choose what to use between mean, median and mode
Before we walk ourselves into the road of ‘how to choose between mean, median and mode’ for calculating our center, I'd like to point out that most of us who have been working on dataset before just go to calculate mean a measure for center for any column without truly knowing the type of data in that column - that's not a good practice. Before you decide what measure to use you should understand the types of data present in your dataset. At this point, let me point out to us that every data can occur at one of the 4 levels of measurement. The levels are:
1) Nominal
2) Ordinal
3) Interval, and
4) Ratio

You may need to use your favorite search engine to read on the different levels of data measurement and have a grasp of what they are, but for the sake of this post I'll be explaining the level of data we can classify temperature data we used as and why we used mean to calculate its center:

The temperature data is at interval level and the only reasonable arithmetic operations that can be performed on it to get the center are subtraction and addition so using mean to calculate the center is a good shot. From mean’s formula you can depict what I'm saying. Can you?
This is not to say, using median won't give us a good information about the center of data, but as we all know in machine learning that accuracy is worth more than diamond, so must you be extra careful of the measures that accurately describe your data.

Conclusion
As simple as that might look, I've discovered that most data scientist do not really know why some of these basic statistical operations we carry out while analyzing data are being carried out and that of course is not a good practice. I hope by reading this little piece you'd able to know when you should calculate the center of your data and the best way you can calculate it.

A bit about my beliefs:
I believe the importance of mathematics and statistics do not lie in the complexities of their formulas and equations, but, rather, on how they can be manipulated to solve human problems.
In the light of that it becomes very important that man should have the sound knowledge/understanding of the principles behind these formulas so as to accurately apply them.

Rather than explaining hard mathematical and statistical operations from formulas only, I’d be here explaining the essence of those operations and the big ‘WHY’ we carry them out.

--

--

Femiloye Oyerinde

A Computer Vision Engineer with research interests in representation learning, self-supervision and vision-language.