Measures of central tendency

Madhuri Patil
7 min readJul 11, 2023

--

Hello, and welcome to the series — 7 days of statistics for data science. In this second article, you will learn the fundamental concepts of measures of central tendency.

Descriptive statistics are summary statistics that represents critical information about the data. The measures of central tendency is one of the summary statistics which is commonly used for data analysis to understand data.

In this article, you’ll learn different methods of measures of central tendency such as mean, median, and mode in details which help you understand the data by representing it into the best single value. You’ll also learn how to select the best measures of central tendency for your data.

Selecting the best measure of central tendency depends on the type of data you have. Hence, it is essential to have a proper understanding of datatypes.

So, Let’s start with that.

Types of Data

There are mainly two types of data—

  1. Qualitative Data (Categorical)
  2. Quantitative Data (Numerical)

Further they classified into four categories —

Nominal Data

Nominal data is the qualitative data type that does not have any order. It is a category of data that represents types. For example, the color of hair (black, brown, blond, etc.), or gender (male or female), or it can be just the person’s name.

Ordinal Data

Ordinal is another categorical data type that has a specific order. For example, temperature (high, medium, or low); students’ grades (A, B, C, D). Ordinal data shows some sequential order and can’t use in statistical analysis.

Discrete Data

Discrete is one of the quantitative data types that we can measure in the count of numerical values. Discrete means distinct or separate. These values are integers or whole numbers that can’t further divide into parts. For example, we count the individual student as one student and not as a 1.5.

Continuous Data

Continuous data is another quantitative data type that measures the range of values. These values can further divide into parts and are often saved in the floating point number format. For example, height can be measured and stored in the form of continuous data (5.4 or 5.11ft).

You can find the datatype of each feature in the dataset using pandas.DataFrame.dtypes attribute. However, pandas provide their own set of data types that represent different data values. You can read here about that to understand data accurately.

To illustrate examples, I will be using the HR Analytics Case Study dataset. The dataset contains information about employees.

Using this data, we can analyze the reasons behind employee attrition, which might help the management to understand what changes they should make to their workplace to get most of their employees to stay. Using the measure of central tendency, we will try to understand the distribution of some of the features.

# Import pandas library
import pandas as pd

# Read Data
data = pd.read_csv("hr-analytics-data.csv")
# To check the datatype of each feature
>>> data.dtypes
Age int64
Attrition object
BusinessTravel object
Department object
DistanceFromHome int64
Education int64
EducationField object
EmployeeCount int64
EmployeeID int64
Gender object
JobLevel int64
JobRole object
MaritalStatus object
MonthlyIncome int64
NumCompaniesWorked float64
Over18 object
PercentSalaryHike int64
StandardHours int64
StockOptionLevel int64
TotalWorkingYears float64
TrainingTimesLastYear int64
YearsAtCompany int64
YearsSinceLastPromotion int64
YearsWithCurrManager int64
dtype: object

You can see that; the given dataset has 24 columns or features representing information about each employee. It contains two floating types, 14 integer types, and eight object data types. The floating point number represents continuous data, the integer represents whole numbers, and the object type represents string or text type of data.

Measure of Central Tendency

In statistics, the central tendency represents the center point or typical values of the data.

The most common methods of measures of central tendencies are.

  • The mean
  • The median
  • The mode

Each measures finds the center point or location where most values occur in the dataset. But the real question is, why do we need to know central tendencies?

In the above case study, you have data on employees who left the organization. Now lets calculate the mean. Consider the average age of employees who left that organization is 35. It tells us that the most common age of the employee who left is 35 or less, and after 35, people would choose stability over changing jobs more often. Of course, there might be other factors included, but for now, management has an idea of which age group of employees to focus more on to avoid attrition.

The central tendency is a single value often representative of data values in a dataset. It helps you to understand where the data falls from the center point.

With the help of the mean as a measure of the central tendency, in the above example, you can conclude that employees of an age greater than 35 are more likely to stay with an organization compared to less than 35.

Let’s understand the measures of central tendency methods and how to find these measurements using Python.

Mean

The mean is one of the most popular methods of central tendency and is often known as the arithmetic average. You can calculate the mean by dividing the sum of all the data points by the total number of observations in the dataset.

# Calculate the mean
>>> data['Age'].mean()
36.92

Median

The median is the middle value of the datasets. It is a value that separates the higher half from the lower half of the data. To calculate the median first sort data in ascending order from lowest to highest and then find the middle value.
If there are an odd number of data values, the median would be the middle value. However, if there are an even number of data values, then the median is the average of those two middle values.

#Find the median.
>>> data['Age'].median()
36

Mode

The mode is the most occurring value in the dataset. A dataset can have no mode, meaning there is no repeat value in a dataset, or it can have one or multiple modes. To find the mode, first sort data in ascending order and then count the number of occurrences of each value. The most frequent value is the mode.

#Find the mode
>>> data['Age'].mode()
[35]

The model value tells us the most frequent category of the data, and it is the only measure of central tendency for categorical data.

How to Select Right Measures of Central Tendency?

All of these measures find the central point of the dataset using different methods. But picking the valid measures of central tendency for your data depends on the data type you’re working with.

Let’s observe the distribution for different datatypes using histograms plots. In each distribution, the most common values are at the peak of the distributions.
Even though the distribution and datatypes are different, you can spot the most common values that represent the central tendency of that data.

Distribution of different datatypes (source from my jupyter notebook)

The mean is the most commonly used central tendency for numerical data type when the distribution is symmetrical or normal and no outliers (extreme values) are present.

You can see in the above plot mean represents the central tendency accurately. In a symmetrical distribution, all measures (mean, median, and mode) are almost equal. You noticed the mean, median, and mode for age feature are almost same, as age data has symmetrical distribution.

The mean considers all the values in the data. If you change any value, the mean will change. If there is an extreme value present in the data, the mean will be highly affected by that value.

In a skewed distribution, the mean will not locate the central tendency accurately, as shown in the figure below. Extreme values will pull the mean away from the center. If the distribution becomes skewed, the mean will shift towards those extreme values away from the center point. The outliers impact the mean as a measure of a central tendency significantly. Therefore, it’s best to use the mean as a measure of the central tendency when you have a symmetric distribution.

However, the median is a more appropriate measure of the central tendency for skewed distributions because the median is robust to the outlier meaning outliers do not impact the median value significantly compared to the mean.

The mode is typically used to find the measure of central tendency for categorical, ordinal, and discrete data. For categorical data, the mode does not represent the central data; the mode is just the representation of the most common value present in the data.

In summary, when you have symmetrical distribution for continuous data, you can use either mean, median, or mode since they all are equal. However, the analyst recommends the mean as a measure of central tendency for normal distribution because it includes all the data in the calculation.
For skewed distribution, the median is often the best measure of central tendency. For categorical data, you must use the mode.

In this article, we learned different methods for measures of central tendency. Knowing central tendency is useful in analysis for making decisions and drawing conclusions.

I hope you found this article helpful and informative. Thank you so much for reading! 🤗🙏🏽

--

--