Running into Variability

Alicia Guzman

Published in

Women Data Greenhorns

5 min readAug 10, 2018

Have you ever wonder

How many people are taller than you?
How many people are faster than Usain Bolt?

Image Credits : Freepik — Sports picture by ilovehz

Well, using standard deviation can help us calculate how far a data value is from the Mean. Do you remember the Mean? Yes? Mean ,is the average number of a data-set , and is a central tendency measure, we are going to use it a lot in this post to calculate Variability!

But, before study variability in deep we have to understand “Ranges”.

First, we must define Range which is the distance of the maximum data from minimum data.

Range = Max Value — Min Value

Ranges sometimes changes when we add new data to the dataset, depending of the location where the data is added, for example the first or the last place will change the Range.

Datasets might contain outliers, and we have to learn how to manage these outliers, so the ranges don’t change too much. In that case we use IQR (Interquartile range).

IQR = Q3 — Q1

To calculate ranges with outliers we must cut off the tails, this is cut off the lower and the upper values in the dataset.

To determinate if a value is outlier we can use the following formulas

Outlier < Q1–1.5 (IQR)
Outlier > Q3 + 1.5 (IQR)

It can be represented using Boxplot, like the one bellow.

Boxplot representation — Image Credits : The Data Visualisation Catalogue

Variability is the extent to which a distribution is stretched or squeezed. This means, the extent to which data points in a statistical distribution or data set diverge from the average or mean.

The most common measures of variability are: Interquartile range, Mean, Variance, and Standard Deviation.

Let’s put this topic in a sample of 5 people who are runners.

Ouh Ouh Ouh WAIT!

First we have the dataset…

So, if we calculate the mean for this data set.

Mean = (20 + 30 + 50 + 60 + 40) / 5 = 40

Now, what we need to do is to calculate the deviation of each value to check what is the distance between every value from the mean. This is every value minus the mean.

Now, if we calculate the average we get = 0 (Zero), but why? This is because the negative numbers cancel the positive. To resolve this, we can do two things:

- Ignore the negative sign, this is take the absolute value. Be happy and get rid of Negatives! =)

- Square each deviation

The formula for the Average of Absolute Deviations

To calculate the Variance, it is necessary to square each deviation to eliminate the negative sign, because in math two negative sign are positive. So, the Variance would be:

Variance = 1,000/5 = 200

The Variance is the area average of the distance of each value from the mean.

Image Credits : Screenshot from Udacity Video

So, to calculate the Standard Deviation, we must square root the variance. The standard deviation is represented by lower case sigma.

Standard Deviation = 14.1421

The importance of the standard deviation is that determinates how spread the data is about the center, that in a normal distribution is approximately the same of the mean, median, and mode.

If we convert the center into 0 (zero), the upper and the lower one standard deviation covers approximately 68% of the dataset, and two standard deviation covers approximately 95% of the dataset.

Normal Distribution — Image Credits : Wikipedia

This is the standard deviation of the entire population, but when we are talking about the standard deviation of a sample we need to use Bessel’s correction to obtain a number more approximate to that of the entire population.

This method corrects the bias in the estimation of the population variance. It also partially corrects the bias in the estimation of the population standard deviation. However, the correction often increases the mean squared error in these estimations.

So, Variability is use to determinate how “spread out” a group of scores or dataset is, we can use Ranges, and even more popular Standard deviation to determinate how far a data or value is from the Mean. When using Ranges and find that the dataset has outliers you have to cutoff the tails, or saying it in another way cutoff the values that are considered outliers (Thankfully, we have formulas to calculate outliers in a dataset); and we can represente the dataset with ranges using Boxplot.

We also studied Standard Deviation which is calculated to indicate the extent of deviation for a data-set as a whole, and can be represented by the lowercase Greek letter: sigma. The standard deviation can help us to determinate how much the data varies from the mean, a low standard deviation indicates that the data is closely clustered around the mean, while a high standard deviation indicates the data is disperse over wider range of values; and can help us to determine whether a value is standard/expected or unusual/unexpected. Now, we also know that 68% of the values fall within one standard deviation of the mean, 95% of the values fall within two standard deviation of the mean, and 99% of the values fall within the three standard deviation of the mean.

We have different tools to simplify work, you can work on Google Sheets and add different formulas that are going to help you a lot to determine standard deviation of large number of data!

Running into Variability

Written by Alicia Guzman