Measures of variability and spread

Dhrubjun
Nerd For Tech
Published in
7 min readSep 12, 2021
Photo by HempCrew on Unsplash
  • Why do we need measures of variability?

The three main measures of central tendency or the three ‘M’s of statistics: the mean, the median, and the mode describe a single value that tries to describe a given data set by identifying the central position or typical value within the data set. These measures tend to indicate where most of the values in a distribution lie. One can think of a cluster of points around a middle value of a data set. But sometimes these measures are not enough to describe or get real information from the data set. For example, the scores of two batsmen in the last 10 matches are as follows-

Player 1 : 2, 104, 25, 12, 74, 6, 9, 83, 32, 3

Player 2 : 42, 32, 16, 50, 38, 18, 63, 26, 31, 34

The coach of the team has to select only one batsman for the final game based on performance in the last 10 matches.

Image by author
Image by author

But the coach got confused as the average of both batsmen is same, i.e. 35. Since this is the final game, he has to select the batsman who has more chance of scoring big or who is more reliable. Here central tendency or mean doesn’t provide complete information. So, we need to understand the variability around the mean to get the full picture.

The Range :

The range is the simplest measure of variability in terms of calculation and understanding. It is the difference between the largest and the smallest values in a specific data set. For example, the range for player 1 and player 2 of the above example is as follows :

Player 1 : 104–2 = 102

Player 2 : 63–16 = 47

Player 1 higher range than player 2 and hence player 1 has more variability than player 2.

Although the range is easy to calculate, it is very susceptible to outliers as it is calculated using the two most extreme values of the data set. If our data set has outliers, then to describe how values are dispersed using range will be misleading. Let’s take a simple data set.

Image by author

In this data set, the numbers are evenly distributed between the most extreme values i.e. upper bound and lower bound. There is no outliers here. The range of the data set is 4. But if we add an outlier in the data set, then the scenario will be completely different.

Image by author

In both the data set, the lowest value is same, whereas the highest value of the second data set has gone up to 11 due to the addition of the outlier. This results a new range i.e. (11–1)=10 for the second data set. The range of the data set has increased by 6 just because of the introduction of only one number, i.e. outlier.

So, the range will include outliers as by its definition it has to include extreme values. Instead of taking the whole range of the data set, one can include a small or mini range of the same data set in order to exclude the outliers. Here come the quartiles to rescue us.

The Quartiles :

As the name suggests, the quartiles will divide the whole data set into 4 equal parts. It is like finding the median where instead of finding the number that splits the data set into two halves in case of the median, the quartiles will find the numbers or values that will split the data set into quarters.

Image by author

In the above picture, the quartiles are shown by splitting the data set into 4 parts. Each part contains 25% of the data set. The first or lower quartile Q1 includes the first 25% of the data whereas the last or upper quartile Q3 includes the last 25% of the data. The middle quartile Q2 is nothing but the median as it divides the data set into two halves.

The range lies between the upper quartile (Q3) and lower quartile (Q1) is called the Interquartile range (IQR).

IQR = Q3-Q1

IQR includes only central 50% of the data set. Since the outliers are the numbers or values which lie at the extremes of the data set, the interquartile range will discard the outliers automatically as it contains only the values around the center of the data set. So we can effectively compare two data sets using IQR which will cut out the outliers from the data sets.

We can use the box and whisker diagram or box plot to visualize the ranges and quartiles. This diagram shows the range, interquartile range, and median of the data set. One can easily compare two data sets using this diagram. We will take the scores of both the batsman again for comparison.

Image by author

From the above diagram, it is seen that-

  1. Player 2 has a relatively small range and his median is also higher than Player 1.
  2. The range and interquartile range of Player 1 is much higher than Player 2. Sometimes he scores a lot higher than player 2, but sometimes a lot low.
  3. Player 2 is more consistent and usually scores higher than Player 1.

One of the disadvantages of range and interquartile range is that they will tell only the difference between the high and low values of the data set. They are not going to tell us how often the values lie near the center. We just don’t want to calculate the spread of the data set, we also want to measure the variability of the data set. So here comes the variance and standard deviation for us.

Variance :

The measure of variability tells us how much each value deviates from the mean. The lower the variability, the higher the values are concentrated near the mean. We can measure the variability by looking at how much each value is separated from the mean of the data set. This can be done using the average distance of each value from the mean. For example, take the scores of Player 1.

Image by author

How is it possible?? Actually, the positive and negative distances of each vale from the mean cancel each other out and the average distance becomes zero. To nullify this effect we can square the distances and then take the average.

The variance of Player 1 (Image by author)

Bravo!! This time we get a meaningful number. This method of measuring spread is called variance. This is one of the most important and widely used methods for describing the spread of the data set.

Similarly, we can calculate the variance of Player 2.

The variance of Player 2(Image by author)

So, the variance of Player 1 is much higher than Player 2. The values are more widely spread from the mean in the case of Player 1.

But in calculating variance, the unit of variance becomes squared the unit of the original data set. Again, higher values of variance indicate the higher variability within the data set. But there is no specific intuitive interpretation of specific value. To resolve this problem we have Standard Deviation.

Standard Deviation :

Standard deviation is nothing but the square root of the variance.

The standard deviation of player 1 = √1285.4 = 35.85

The standard deviation of player 1 = √182.4 = 13.5

The standard deviation tells us how far typical values are from the mean of the data set. It is more intuitive than variance. Smaller the value of standard deviation, the more clustering of values nearer to the mean. The smallest and highest value that the standard deviation can take is 0 and infinity respectively.

So, the coach is going to choose Player 2 as he is more reliable.

Hope you guys like my article. That’s all for today. Have a nice day :)

--

--