Statistics for Data Science Part 1: Use of Central Tendency for Data Analysis.

Explaining the working of the most common central methods like mean, median, mode and how it can help in dealing with our data.

Krati Agarwal
Analytics Vidhya
5 min readJun 22, 2020

--

As we know to deal with our data has several steps like data extraction, data cleaning, handling missing data, exploratory data analysis, etc., and statistics play a very important role in many of these steps. So today, let’s start with the basic step of understanding the role of central tendency methods used during some of the steps.

What is Central Tendency?

Central Tendency is the measure of very basic but very useful statistical functions that represents a central point or typical value of the dataset. It help’s in indicating the point value where the most value in the distribution falls, referring to the central location of the distribution. The most common central tendency methods used for the analysis of numerical data are mean, median, and mode.

Mean

The mean is the most common and well-known method for measuring central tendency and can be used to handle both discrete and continuous data. We can calculate mean as the sum of all the values in the dataset divided by the number of values in the dataset and is denoted as ‘µ’.

Mean is not often one of the actual values that you have observed in your data set, but it is one of the most important properties as it minimizes the error in predicting the value in any dataset. The reason behind having the lowest error is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.

In the below image, we can see the histogram for an array of values and then calculate the mean by summing all the values on the x-axis and just dividing by the number of values, i.e 12.

However, the disadvantage of using the mean is that it is particularly susceptible to the influence of outliers. Outliners are the value that is very unusual as compared to the rest of the data, like making a particular value very small or very large as compared to the rest. Focus on the case when our data is skewed, or we can say that when the data is perfectly normal, the mean, median, and mode are identical. In this case, it means lose its ability to provide the best central location for the data because the skewed data is dragging it away from the typical value.

The below histogram shows the image with the skewed dataset, and hence all the three mean median and mode will be approx equal to each other.

Median

The Median is the middle value of your observation when the values in the dataset are ordered from the smallest to the largest. If the number of values in the dataset is an odd number, then the middle value is the median. But if you have odd number values in the dataset, then to find the median, we just take the average of the two middle values.

The below histogram shows the relationship between the mean and mode if we have symmetric data.

The median is less affected by outliers and skewed data and hence can be used in our dataset if we have outliers or if our data is skewed. It's because the median represents a point value (not necessarily a score in the data set) above and below which 50% of the scores or observations fall, and therefore, the value away from the median is inconsequential.

The disadvantage of the median is it does not take into account the precise value of each observation and hence does not use all information available in the data, and if we combine the values of the two datasets, a median of the combined dataset cannot be expressed in terms of the individual medians of the datasets.

Mode

Mode is mostly used if we have the dataset having nominal or ordinal values in it. We can describe mode as the most frequently occurred value in the dataset.

The below histogram shows the service quality rating by the number of customers, and hence we can calculate the mode by seeing which bar has the highest raise; in this case, we can see that say, the mode or the maximum customer rated the service as very satisfactory.

Some data sets do not have a mode because each value occurs only once. On the other hand, some data sets can have more than one mode. This happens when the data set has two or more values of equal frequency, which is greater than that of any other value. Mode is rarely used as a summary statistic except to describe a bimodal distribution. In a bimodal distribution, the taller peak is called the major mode, and the shorter one is the minor mode.

The disadvantage of the mode is that it is not algebraically defined, and statistical analysis becomes very difficult, which generates the fluctuation in the frequency if we have a very small sample size.

Now the question arises which is best in mean, median, and mode?

If we are having categorical data, or we can say nominal or ordinal dataset, it becomes impossible to calculate the mean and median in this case and hence best is to use calculated mode.

But if we have quantitative data, then the best practice is to go for mean or median and in-case if this data has an outlier or if it is skewed, then the median is the best measurement for finding central tendency.

In all other cases, the best practice is to use mean as it shows the least error.

--

--

Krati Agarwal
Analytics Vidhya

Data Science and Machine Learning enthusiastic. I believe in working on data to make data work for us.