Descriptive Statistics 1: Measures of Center

Marco Angelo
12 min readJul 3, 2019

--

Dealing with the Mean, Median, and Mode in statistics

When dealing with data analysis, questions surrounding the average case, or the “centralness” of the data, can pop up quite often.

The 4 main aspects of analyzing quantitative data are the measures of center, spread, shape, and outliers. For this article, we will be focusing solely on measures of center (and coming up for you: articles on the other 3 aspects!)

Say that there are 2 professors teaching the same class subject. You look at the grades of students in Professor A’s class and Professor B’s class. You then make the observation that people tend to get lower grades in Professor B’s class.

Professor A
96, 95, 84, 87, 54, 67, 61, 92, 81, 99
Professor B
60, 68, 43, 52, 67, 78, 90

Sure, you can say that “Professor B’s students, on average, get worse grades.” But how can you prove that? Quantify that?

This is where measures of center come in.

Measures of center give you an idea of the average case. They are used primarily to analyze what, on average, a given scenario would look like.

Measures of Center

The top 3 values used to analyze measures of center are the mean, median, and mode.

Measures of center are primarily used to analyze discrete and continuous quantitative data. (discrete data is data that you can count, while quantitative data is any number you can reduce into smaller bits.)

In order to calculate the mean, median, and mode and explore the idea of measures of center, let us examine the table below.

This mock table depicts how many students attended class in 1 given week. (Hmm, I wonder why nobody cares about attending The Sociology of Miley Cyrus?)

Mean

Suppose you are asked the following question, “How many students do you expect to show up on a given day?” The word expect usually calls for calculating the mean. Generally speaking, you calculate the mean value when the question requires you to make an estimated average.

The mean is the average value of a set of numbers. The mean is representative of the central tendency in the data — it is what all the other values center around.

To calculate the mean, , simply add up all the values and then divide by the number of total values, n.

Our mean value is 2.8

Intuitively, this mean appears correct. For an additional brain teaser, here is a great demonstration of how you can intuitively visualize the mean.

Notation

Below is the mathematical notation for the mean of a set of values:

notation for the calculation of the mean

For those unfamiliar or rusty with mathematical notation, this may appear a bit redundant or tricky. But let us break down the parts of the notation:

labeled notation for the calculation of the mean

This notation essentially says, “The mean is equal to the sum of all values from a specified starting value to an ending value, divided by the number of values.”

The Σ symbol is a Greek letter named sigma, which represents “the sum of” in mathematics. The value under Σ is the starting value, i, and the value above Σ indicates the ending value — the point where you stop counting.

Some food for thought: here is a cool visualization of what the mean represents in a K-Means Clustering Algorithm. This is an algorithm commonly used in machine learning — the blue dot represents the mean per cluster of points.

Median

The median can be seen as the exact middle value of a dataset where 50% of the values are larger and 50% of the values are smaller.

However, before going to find the median, there are 2 things you must account for: 1) the order of the data, and 2) whether n is even or odd.

For demonstration purposes, let us work with 2 different datasets: one set has an odd number of values, and the other has an even number of values. We will get back to our class attendance set afterward.

Now it’s time to find the median of each set.

Step 1: Rearrange

When dealing with the median, you need to rearrange the dataset into ascending/descending order.

Here, we will rearrange the arrays in ascending order.

Our values rearranged.

Voila! Both datasets are rearranged in ascending order.

Step 2: Odd or Even?

When n is odd, the median is the exact middle value.

When n is even, the median is the mean of the 2 middle values.

50 is smack dab in the middle of all the values in this odd-sized array.
14 is the mean of the middle 2 values in this even-sized array.

As can be seen in the first array, there are exactly (n-1)/2values before and after the median. Translated to English, this mathematical notation says “half of all values excluding one (the median) lie on both sides of the median.”

Let’s look at that case for the odd-sized set:

  • n = 9
  • (n-1)/2 = 4 values above and below the median

In the second array, the universal rule still applies: there are(n-1)/2 values above and below the middle 2 values.

  • n = 8
  • (n-1)/2 = 3.5 values above and below the median

But wait, 3.5 isn’t a whole number! How can this be? You can’t have “half” of a value.

Turns out that in this case, yes you can. This is because the middle 2 values are treated like 1 value to account for the median, so they are treated as half a value each.

13 and 15 are considered 1/2 a value each because they make up the median.

If you believe you’ve found the median and there aren’t the same number of values on the left and right, then you have done something wrong.

This is where I introduce my method of median-spotting so that you can reduce the chances of committing mistakes when looking for the median.

Median-Spotting

A visual trick I personally like to use is to count by 1 from both ends of the array.

Keep incrementing from the beginning and decrementing from the end at the same time, and whatever value(s) you end up pointing to is automatically the median. To me, at least, my method has made median-spotting much more intuitive and fast.

Now let us go back to the Sociology of Miley Cyrus class attendance dataset, and find the median using my median-spotting technique.

Step 1: Rearrange

Step 2: Spot the median

Using our rearranged student attendance numbers, let us start counting from both ends.

Counting towards the middle by 1 from both ends

We end up pointing to the same value: 3. Therefore it is the median.

If we were dealing with an even-sized array, we’d be pointing at 2 values in the middle instead of 1.

Sidenote: it can be easy to confuse the mean and median together. The way I like to think about it is this:

  • the mean is the “average” point of the data, which all other points center around.
  • the median is the “pivot” point where half of all other values are below, and half of all other values are above.

Mode

The mode is the value that occurs the most often.

If we look at the student attendance set and count the number of occurrences for each number, we can see that 3 has the highest count, and therefore shows up the most often.

Here, I color coded the values so that it’s easier to visually correspond each value to the number of times it shows up.

3 shows up the most often: twice. It is also the only value that shows up more than once, so our mode is 3.

  • But what if you have a set of values where there are more than 1 repeating values that show up at the same rate?
  • What if you don’t have any repeating values at all?

Mode or No Mode?

Can you think of any bugs that you could stumble over when looking for the mode?

Sidenote: when looking for the mean or the mode, it doesn’t matter whether your dataset is arranged or not.

Many repeating values

What if there is more than 1 value that has the highest frequency count?

It is acceptable to say that there is more than 1 mode to a dataset. Let us take the below sequence of numbers, for example.

Before scrolling down for the answer, can you visually spot the modes of the above dataset?

78 and 34 show up twice, and they are the greatest frequency, so our modes are 78 and 34.

Be aware that this is a simple example. In another case, there could be values that show up 5 times and others 3 times, 2 times, etc. The mode would still be the value that shows up the most frequently. So in this specific case, the mode would be the value that shows up 5 times.

No Repeating Values

What if there is no repeating value in our dataset?

Opposite to the case above, it is acceptable to say that there is no mode for a single dataset. Let us observe the below sequence of numbers.

Can you find any repeating values in the dataset given above?

There is no repeating value in this data set, and all values occur at the same rate.

Constant Rate of Occurrence

What if all values occur at the same rate?

Now that we ruled out the cases of many repeating values and no repeating values, here is the last case. What if each value occurred more than 1 time, but all at the same rate? For example, what if all values appeared 3 times? Technically, they all appear more than once — but no value is the one that “occurs the most.”

But How Do I Pick the Mean, Median, or Mode?

Now that the basics of mean, median and mode are covered, the trickier part is knowing which one to use.

If I am given a statistical question about the average case in a data set, then I automatically know that I need to focus on using a measure of center.

Ruling Out the Mode

When dealing with questions of the “typical” or “average” case, it can sometimes become tricky to pick among the mean, median, or mode.

But first, let us rule out the mode — this appears to be the most distinct from the other 2 values. It is best to pick the mode in scenarios where you want to see which is the value that repeats the most.

When talking about the average value in a dataset (e.g. average test score among a set of tests), we’re usually referring to the mean, which is representative of the sample.

On the other hand, the median is the middle value — smack dab in the middle — where 50% of the values are above, and 50% are below.

Both the mean and median are related to the “centralness” of a dataset. Because of this, it is easy to see why picking between the two would be tricky.

So Should I Pick the Mean or the Median?

The mean is the value that people tend to be most familiar with, but the median can also be used as an alternative to the mean.

Let us take that Miley Cyrus class attendance dataset for 1 week. In general, you should calculate both the mean and the median, and make a decision after that.

If your mean and median are not very different, go with the mean.

If your mean and median are considerably different, then that indicates there is an outlier in your data, and thus a skew. In this case, pick the median as it is more reflective of the data distribution.

Outliers and skews will be covered in my 3rd article in this series: Descriptive Statistics 3.

Now let us apply this to the Sociology of Miley Cyrus class attendance case. The question posed is: “How many students, on average, attended The Sociology of Miley Cyrus in this specific week?”

Our previously calculated mean and median.

The mean and the median are not too different. Also, you cannot have 2.8 of a person, so let’s round that to 3. When we do this, the mean is the same as the median.

Judging by our values, we come to the conclusion that we opt for the mean to answer the question:

On average, 3 people attended The Sociology of Miley Cyrus in this specific week.

Sneak peek into outliers

Let’s change the class attendance so that there’s an extraordinarily large value. Let’s say 200 people attended on Friday instead of 3 because the professor said there’s a really important test coming up, and the Sociology of Miley Cyrus will determine whether or not you’ll graduate. (Apparently it’s a REALLY important class).

That changes our arranged dataset to this:

Now let’s calculate the mean and median:

The mean and median are considerably different. In fact, they’re not even close. After rounding down, the mean indicates that on average, 42 people attended the class. The median indicates that 3 is the central existing value of people who attended.

But if we look at the numbers, nobody above 5 people attended class from Monday to Thursday. No number of people even remotely close to 42 attended class. Ever. So in this case, it is best to pick the median to say that on average, 3 people attended class this week. That or you could just remove the outlier, which is what people mostly tend to do.

Conclusions

Big Picture

  • There are 4 ways to analyze quantitative data: through measures of center, spread, shape, and outliers. Here we’re focusing on measures of center.
  • The mean is the average value of a data set.
  • The median is the middle of a data set, given that the data is sorted in ascending order.
  • The mode is the value that shows up most frequently.

The Little Stuff

  • Before looking for the median, you have to order your dataset first.
  • If the number of data, n, is even, then the median is the mean of the middle 2 values.
  • If n is odd, then the median is the number that’s smack dab in the middle.
  • There are (n-1)/2 values on each side below and above the median.
  • A good way to spot the median quickly is to simultaneously count by 1 from both ends of the dataset.
  • The mean and median may appear similar at first. To make things clearer, you can think of the mean as the “average value” and the median as the “pivot point” where half of all other values fall below, and half of all other values are above.
  • It is acceptable to say that a set of values has more than 1 mode or no mode at all.

The concepts covered in this article may appear to be high school review. But it is still beneficial to review and understand how and why the mean, median, and mode are categorized as they are — as measures of center.

--

--