Measuring the Heart of Your Data: Mastering Central Tendency — Statistical Symphony Series

Arun Prakash Asokan
11 min readJan 28, 2023

--

Are you ready to turn your data from bland and boring to sizzling and spectacular? Ladies and gentlemen, gather round because it’s time to get down and dirty with data! Well, buckle up folks, in this part of the Statistical Symphony Series we’re about to take a wild ride through the exciting world of measure of central tendency, measure of spread, and measure of shape.

That’s right, we’re talking about the statistical trifecta of data analysis, and it’s time to learn how to make your data dance like Michael Jackson, and look like a supermodel.

Let’s get started by zooming in onto the measures of central tendency.

Measures of Central Tendency

Measure of central tendency is the statistical term for the middle value of a set of data. It’s a way to understand the “middle” or “average” value of a set of data.

Measure of Central Tendency is also known as Measure of Location.

It helps us to summarize a large amount of information into a single value that is easy to understand and interpret.

The two main types of averages are Mathematical Average and Positional Average.

Mathematical Average

There are 3 types of mathematical averages.

  1. Arithmetic Mean
  2. Geometric Mean
  3. Harmonic Mean

“Arithmetic Mean” or just “Mean”

The mean is the sum of all the values in a dataset divided by the number of values. It is also known as the average.

For example, if a researcher wants to know the average height of students in a school, he would add up all the heights of the students and divide by the number of students.

Mean can be affected by outliers, which are extreme values that are very different from the rest of the data. This can cause the mean to be skewed and not give an accurate representation of the central value.

For example, let’s say you have a group of 10 students and their test scores are as follows: [90, 92, 93, 95, 100, 100, 100, 100, 100, 100]. The mean score is 96.5. Now let’s say one student, who did not study for the test, scores a 20. The new scores are [90, 92, 93, 95, 100, 100, 100, 100, 100, 100, 20]. Now the mean score is 74.5, which is significantly lower than the original mean score of 96.5. This is an example of how a single outlier, in this case the student who scored a 20, can greatly impact the mean score and skew the overall picture of the group’s performance.

Three statisticians went out hunting, and came across a large deer. The first statistician fired, but missed, by a meter to the left. The second statistician fired, but also missed, by a meter to the right. The third statistician didn’t fire, but shouted in triumph, “On the average we got it!”

Harmonic Mean

Harmonic mean is a type of average that is used to find the central tendency of a set of numbers. The formula for harmonic mean is:

Harmonic Mean = n / (1/x1 + 1/x2 + 1/x3 + … + 1/xn)

where n is the total number of values in the set and x1, x2, x3, …, xn are the individual values.

A practical application of harmonic mean is in finding the average speed of a vehicle. For example, if a car travels 60 km/h for 3 hours, then 80 km/h for 2 hours, and then 100 km/h for 1 hour, the harmonic mean of the speeds is (3+2+1)/(1/60 + 1/80 + 1/100) = 72.4 km/h. The harmonic mean of the speed will be a more accurate representation of the average speed over the journey than the simple arithmetic mean.

Another example is in music, where the harmonic mean is used to calculate the pitch of a note by taking into account the frequencies of the overtones that make up the note.

The intuition behind the formula is that it gives more weight to the lower values in the set, resulting in a value that is closer to the minimum value in the set. This is because the formula takes the reciprocal of each value (1/x) and then adds them up, which in turn gives more weight to the smaller numbers.

This is useful in situations where the minimum value is more important than the maximum value, such as in the examples of speed and pitch mentioned above.

It should be noted that harmonic mean is not suitable for sets of data with negative or zero values.

Geometric Mean

The geometric mean is a type of average that is used to measure the central tendency of a set of data that includes values that are not all the same. It is calculated by taking the product of all the values in the data set, and then taking the nth root of that product, where n is the number of values in the data set.

For example, let’s say we have a data set of four values: 2, 4, 8, and 16. To find the geometric mean of this data set, we would first multiply the values together to get 512. Then, we would take the fourth root of 512 (since there are four values in the data set) to get 4. So, the geometric mean of this data set is 4. This value represents the typical or average value of the set, as it is the number that can be used to find the product of all the numbers in the set if you multiply it by itself 4 times.

The geometric mean is a type of average that is used to calculate the central tendency of a set of numbers that are multiplied together.

The intuition behind the formula is that it gives a sense of the average value of the data set when the values are multiplied together.

This is useful in cases where the data represents quantities that are multiplied together, such as growth rates or rates of return. A practical application of geometric mean is in the stock market, where investors use it to calculate the average rate of return of a portfolio over time. By taking the geometric mean of the portfolio’s growth rates, investors can get a sense of the overall performance of the portfolio, rather than just looking at its highest or lowest values. This can be useful in determining the overall risk and return of an investment.

For example, if an investment has a return of 10% in the first year, 20% in the second year, and 30% in the third year, the geometric mean return would be ((1+0.1)(1+0.2)(1+0.3))^(1/3) — 1 = 0.22 or 22%. This value represents the average rate of return for the investment over the three-year period, taking into account the compounding effect of the returns.

haha… Do you get the joke ?

Positional Averages

Positional averages determine the position of variables in a given set of data. In other words it is a type of average that is determined by the position of each data point in a set of numbers, rather than the sum of the numbers. This is an effective measure for handling nominal kinds of data. Let’s discuss all popular measures shown below in the picture.

Median

The median is a measure of central tendency that is used to find the middle value of a dataset when the data is arranged in numerical order.

It is a useful tool for understanding the center of a dataset, especially when there are outliers or extreme values present.

To find the median, you first need to arrange all of the values in the dataset in numerical order. If there is an odd number of values, the median will be the middle value.

For example, if the dataset is [1, 2, 3, 4, 5], the median is 3. If there is an even number of values, the median will be the average of the two middle values. For example, if the dataset is [1, 2, 3, 4, 5, 6], the median is (3+4)/2 = 3.5.

For example, if a researcher wants to know the median height of students in a school, they would arrange all the heights of the students in numerical order and find the middle value. Median is resistant to outliers as it is based on the middle value and is not affected by extreme values.

The intuition behind the median is that it gives an idea of the “typical” value of the dataset. It is not affected by outliers or extreme values, unlike the mean. For example, if a dataset has an outlier value of 100, the mean would be skewed towards that value, but the median would still be the middle value of the dataset.

A practical application of the median is in the field of real estate, where it is used to find the median price of homes in a particular area. This helps buyers and sellers understand the typical price range of homes in that area, and can also be used by researchers to study trends in housing prices over time.

Mode

Mode is the value that occurs most frequently in a dataset. It is a measure of central tendency that is commonly used to describe nominal or ordinal data. For example, in a dataset of students’ favorite colors, if the most common color is “blue,” then “blue” would be the mode.

For example, if a researcher wants to know the mode of the favorite colors of students in a school, they would find the color that is chosen most frequently by the students.

  1. One practical application of mode is in identifying the most popular product in a store or the most common complaint in a survey.
  2. Another example is in the field of genetics, mode is used to find the most common allele (form of a gene) in a population

It’s important to note that a dataset can have multiple modes, or no mode at all.

Mode is also resistant to outliers as it is based on the most frequently occurring value and is not affected by extreme values. In other words, mode is not affected much as it only considers the most frequent value, regardless of the values’ magnitude.

Image Courtesy — Sharon E. Robinson Kurpius, Mary E. Stafford & Jason Love

The Art of Splitting Data: Quantiles — Quartiles , Deciles and Percentiles

Quantiles

Quantiles are the values that divide a dataset into equal parts. Quartile, Decile and Percentiles are a specific types of quantiles that divides a dataset into equal parts.

Quartiles

Quartiles are a set of three values that divide a dataset into four equal parts, each part represents 25% of the data. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the median (50th percentile), and the third quartile (Q3) is the 75th percentile.

Image Courtesy — Sigma Magic

For example, imagine you have a dataset of 10 numbers. To find the quartiles, you would first need to sort the data in ascending order and then split the data into 4 equal parts as show in the pic below.

  • The first quartile (Q1) is the 25th percentile, which is the value that separates the lowest 25% of the data from the highest 75% of the data. In this case, Q1 = 3.
  • The second quartile (Q2) is the median (50th percentile), which is the middle value of the dataset. In this case, Q2 = 7.5 (average of the 2 middle most values i.e 7 and 8, as there are even number of elements).
  • The third quartile (Q3) is the 75th percentile, which is the value that separates the lowest 75% of the data from the highest 25% of the data. In this case, Q3 = 11.

Deciles

Deciles are a set of values that divide a dataset into 10 equal parts. The first decile is the 10th percentile, the second decile is the 20th percentile, and so on.

For example, imagine you have a dataset of the heights of 100 people, and you want to find the 8th decile. The 8th decile is the value that separates the lowest 80% of the data from the highest 20% of the data. So, the 8th decile would be the height that 80 of the 100 people are shorter than it.

Percentiles

Percentiles are a set of values that divide a dataset into 100 equal parts. The nth percentile is the value that separates the lowest n% of the data from the highest (100-n)% of the data.

For example, imagine you have a dataset of the weights of 100 babies, and you want to find the 95th percentile. The 95th percentile is the value that separates the lowest 95% of the data from the highest 5% of the data. So, the 95th percentile would be the weight that 95 of the 100 babies are lighter than.

Another example, imagine you have a dataset of exam scores of 20 students, and you want to find the 90th percentile. The 90th percentile is the score that separates the lowest 90% of the data from the highest 10% of the data.

In summary, quartiles, quantiles, deciles and percentiles are different ways of dividing a dataset into equal parts, and are used to understand the distribution of data. They are commonly used in statistics to understand the position of the data and to compare the different values within a dataset.

In the next part of #StatisticalSymphonySeries we will discuss about measures of spread and its nuances. I promise to keep it super interesting !

That’s a wrap! Clap if you like :) I’m Arun Prakash Asokan, follow me on Medium for more such content on AI, Statistics, Tech & Personal Finance. See you soon !

--

--

Arun Prakash Asokan

Passionate Data Scientist | AI Intrapreneur | Ardent Teacher | Personal Finance Enthusiast. Follow me for rich content on AI, Statistics, Tech, Personal Finance