#3 | Measures of Variability | 7-Days of Statistics for Data Science

Madhuri Patil
12 min readAug 8, 2023

--

Hey, welcome to the series — 7 days of statistics for data science. In this third article, you will learn the fundamental concepts of measures of variability.

Data Science is the process of extracting insights from the data using various methods, such as statistics, visualization, and machine learning. One of the most aspects of Data Science is to understand the characteristics of the data, such as its distribution, central tendency, and variability.

In the previous article, we studied the measures of central tendency that describe a central point or typical value of the dataset, which indicates where the most values in a distribution fall and also the central location of the distribution.

The measures of variability with the measures of central tendency provide a better understanding of data and its distribution than measures of central tendency alone. There are different ways to measure variability, depending on the type and shape of the data. Some of the most common measures of variability are the range, the interquartile range, the variance, and the standard deviation.

In this article, we will focus on the measures of variability. We will also learn why the measures of variability are important for data science along with some examples that explain the common measures of variability. At the end of this article, you will also be able to answer the question — which method of measures of variability should select?

To illustrate examples, we will use the same dataset HR Analytics Case Study as the previous article and continue our statistical analysis using pandas library.

Pandas is a powerful library of Python for data analysis and manipulation. It provides various methods and functions to calculate measures of variability for a given data set.

So, Let’s get started!

What are Measures of Variability?

In statistics, the variability is also known as dispersion, scatter, or spread. It explains the variation in data values in terms of distribution, which tells us how far the data value falls from the center.

A low dispersion represents the data clustered around a center, while high dispersion indicates the widely spread data from the center.

Why understanding measures of variability is necessary?

Central tendency is a way to determine the most frequently occurring value in a dataset. For instance, the mean is often used to represent the central point of a dataset. However, it does not provide information on the distance of a specific data point from the center.

Low variation represents clustered data around the center, which means the data are similar as they fall around the center point. However, high variation in the data set describes the scattered data points, which means that the data values are dissimilar and might have extreme values.

Normally distributed & Skewed distribution with extreme values.

An outlier or extreme value can cause by various factors, such as measurement errors or natural variability. Outliers are data points that differ significantly from the rest of the data in a dataset. They are usually large or small observations.

Outliers can have a negative impact on machine learning models, as they can distort the statistical properties of the data, such as mean, variance, and correlation, and affect the performance and accuracy of the models. The measure of dispersion helps you to avoid such events.

Let’s learn further in detail about different measures of variability.

The Range

The range is a measure of dispersion that indicates how much the values of a feature vary across the data set.

It is the difference between the maximum and the minimum values.

Range = Highest Value — Lowest Value

In machine learning, it’s important to consider the range of the data.
This can help with understanding how the data is distributed and its scale. Additionally, it also helps while applying normalization or standardization techniques.

Normalization and standardization are two techniques that often used in machine learning to rescale the data values. Normalization transforms the data values to a range between 0 and 1, while standardization transforms the data values to have a mean of 0 and a standard deviation of 1. These methods are commonly used to improve the performance of algorithms.

If you look at the HR analytics dataset, you may notice in the data that feature ‘Age’ ranges from 18 to 60, while feature ‘MonthlyIncome’ ranges from 10k to 200k. The income is 1000 times higher than age, and it may influence the model performance due to its high values. But it doesn’t mean it is the most important feature.

To avoid this, normalization is used to rescale the data into similar ranges or common scale i.e. range from 0 to 1. There are different methods for data scaling for example simple feature scaling, min-max scaler, and z-score/ standard scaler. For that, you can use the max() and min() methods of a pandas.

>>> data['MonthlyIncome']
0 131160
1 41890
2 193280
3 83210
4 23420
...
4405 60290
4406 26790
4407 37020
4408 23980
4409 54680
Name: MonthlyIncome, Length: 4410, dtype: int64

>>> max_val = df['MonthlyIncome'].max()
>>> min_val = df['MonthlyIncome'].min()

# To calculate the range
>>> max_val - min_val
189900

# Normalization using simple feature scaling.
>>> x_new = df['MonthlyIncome'] / max_val
>>> x_new.min(), x_new.max()
(0.0, 1.0)

After scaling all feature have a similar ranges and hence same influence on machine learning model.

The range is simple to understands but it has some limitations you need to consider before using it. The range is sensitive to the outliers, as it uses the two most extreme values. If one value in the dataset is extreme low or high it can affect the entire range which can misleads interpretations about variability.

The Interquartile Range(IQR)

The interquartile range represents the spread of the middle 50% of your data. To understand the Interquartile Range(IQR), consider a dataset divided into quarters, also known as quartiles in statistics, after sorting the dataset in ascending order, denoted from low to high by Q1, Q2, Q3, and Q4. The Lowest quartile (Q1) covers the smallest quarter of values, while the upper quartile (Q4) covers the highest quarter values.

The interquartile range is the middle half of the data that lies between the upper and lower quartiles.

The IQR is the difference between the 75th percentile(Q3) and 25th percentile(Q1) of the data, which is (75–25) 50% of the data.

Interquartile range (IQR) = Q3 -Q1

The interquartile range is robust to outliers as it excludes the extreme values; Outliers has no effect on the interquartile range and median as they don’t depend on every value of the dataset. Additionally, like the median, interquartile range is excellent for skewed distributions.

Understand IQR Using Boxplots

Boxplots are a great way to visualize the interquartile ranges, the various distribution, and their relation to the median of the data. These graphs show the range of values based on the quartiles and individual points for outliers.

The boxplot below displays the different data distributions divided into quartiles.

The box in the boxplot contains 50% of the data, which is the interquartile range. The different box sizes represent the variability in data. Wider the boxes, the more dispersion in distributions is observed.

The line in a box represents the median. If the median is near the center of the interquartile range, then data is Normally Distributed.
However, If the median is closer to either side of the box, then the data has a skewed distribution.

Use IQR to Find Outliers

The interquartile range is often used to find outliers in data. You can use the interquartile range to calculate the upper and lower extreme values.
Generally, outliers are the observations that fall outside of these range values.

Upper extreme = Q3 + 1.5*IQR

Lower extreme = Q1 -1.5 * IQR

In a boxplot, these values are indicated by whiskers of the box (bar at the end of the whisker), and values outside of these limits are outliers represented as individual points.

You can use these values to find outliers. These outliers can then be removed, replaced, or rescaled depending on the problem and the data.

Let’s find the extreme values for the feature ‘MonthlyIncome’ to discover if there are any outliers present in the data.

# find the first & third quartile using `quantile()` method of pandas.
Q1 = df['MonthlyIncome'].quantile(q=0.25)
Q3 = df['MonthlyIncome'].quantile(q=0.75)

# Interquartile Range
IQR = Q3 - Q1

# To find extremes
upper_extreme = Q3 + 1.5*IQR
lower_extreme = Q1 - 1.5*IQR

# To find outliers
income_arr = np.array(df['MonthlyIncome'])
higher_extreme = np.where(income_arr > upper_extreme, 1, 0)
lower_extreme = np.where(income_arr < lower_extreme, 1, 0)

# Is outlier present in a data?
>>> any(higher_extreme) | any(lower_extreme)
True

You can use graphs such as boxplots or histograms to find whether a particular variable has an outlier instead of using this entire process. However, you can use this method to replace the outliers from the features with extreme values instead of removing them completely.

Variance(Var)

Until now, we have used the ranges to measure the dispersion and to understand the distributions of the dataset. Unlike these measures, variance considers all data points in calculations.

Variance is the average squared difference between the data value and the mean.

formula to calculate variance.

Here,

n is total observations, x(i) is the ith observation of x, and x_bar is a mean of the x observations.

Variance explains how far a particular value deviates from the mean. When there is no variability in the data, the variance is zero since all the values are the same as the mean. But as data values spread further, variability increases. Higher variance represents more deviations in the dataset.

You can use the variance to understand how much the predictions of a machine learning model vary from the mean of the predictions.

A high variance model is prone to overfitting, which means it captures the noise and the specific patterns of the training data but fails to generalize well to new and unseen data.

A low-variance model is more stable and consistent, but it may underfit the data, which means it misses some significant features or relationships that could improve its performance.

The goal of any Machine Learning model is to find a balance between variance and bias, which is another source of error that occurs when a model makes incorrect assumptions about the data.

Check out my article Finding a Balance in Bias-Variance Trade-off for insight on Variance and Bias.

Let’s predict the ‘MonthlyIncome’ of an employee using the Linear Regression algorithm to understand the variance in predicted values.

For that, we’ll perform feature selection and prepare our data for the model training before making predictions.

# Import required libraries
from sklearn.model_selection import train_test_split

# Selecting features
num_variables = ['Age', 'Education', 'YearsAtCompany']
cat_variables = ['BusinessTravel', 'Department', 'Gender', 'MaritalStatus', 'Attrition']
label = ['MonthlyIncome']

# Split the data into training & test dataset
df_train, df_test = train_test_split(data[features + label], test_size=0.3, random_state=42)

# Split data into X and Y
X = df_train.drop(label, axis=1)
y = df_train['MonthlyIncome']

### --- Data Preprocessing ---
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([
('encoder', OneHotEncoder(), cat_variables),
('normalization', MinMaxScaler(), num_variables)
])

# Fit model on training data
transformer.fit(X)

# transform data
X_transform = transformer.transform(X)

### ---- Linear Regression Model training ----
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
lr_model.fit(X_transform, y)

# make a predictions
predictions = lr_model.predict(X_transform)
mean = np.mean(predictions)
var = np.var(predictions)

print(f"Mean:: {round(mean)}")
print(f"Variance in predictions:: {round(var)}")

"""
Mean:: 63938
Variance in predictions:: 28896912
"""

The variance calculation involves squared differences, which results in squared units that are different from the original data units. It makes it difficult to interpret compared to other measures of variability. However, the standard deviation can address this issue. To learn more, let’s continue reading.

Standard Deviation(STD)

The standard deviation is a summary statistic to measure the variability in a dataset. It represents the typical distance between each data point and the mean.

A low standard deviation represents that the data points closer to the mean and values are similar. Conversely, a higher standard deviation represents that the values fall further from the mean, and data values become more dissimilar.

The standard deviation is square root of the variance.

formula to calculate the standard deviation.

It has the same unit as the data values, which makes interpretation easy. Hence, the standard deviation is the most commonly used measures of variability.

The standard deviation with mean is commonly used when you have normally distributed data. It provides information on the proportion of the observations that fall above or below certain values.

In machine learning, the standard deviation can be used to measure the uncertainty or variability of a model’s predictions or to compare the performance of different models on a given dataset.

A common way to do this is to use cross-validation, a technique that splits the dataset into multiple subsets and trains and tests each model on different subsets. The performance metric (e.g., accuracy, error rate, etc.) of each model can then be calculated for each subset, and the mean and standard deviation of these metrics can be used to compare the models.

Let’s continue with the previous example to analyze the performance of serval models using standard deviation and the mean. For that, we will train models and evaluate their performance using the Mean Absolute Error metric using cross-validation.

# Model evaluation using Cross-validation
models = {
'linearReg' : LinearRegression(),
'tree': DecisionTreeRegressor(random_state=42),
'randomForest': RandomForestRegressor(random_state=42),
'knn': KNeighborsRegressor()
}

# Model training
for name, model in models.items():
scores = cross_val_score(model, X_transform, y, scoring='neg_mean_absolute_error', cv=10, n_jobs=-1)
mean = round(np.mean(-scores))
std = round(np.std(-scores))
print(f'{name} - MAE: {mean} (+-{std})')

"""
linearReg - MAE: 35970 (+-1686)
tree - MAE: 8383 (+-1449)
randomForest - MAE: 14726 (+-1288)
knn - MAE: 31065 (+-1413)
"""

The mean absolute error is a metric that calculates the average magnitude of errors between predicted and actual values. This metric helps us determine the accuracy of a model when applied to new, unseen data. We need to choose the model with the lowest error with lowest standard deviation for accurate predictions.

In this case, the decision tree model has the lowest error. So, if you use the decision tree for the prediction of an employee’s monthly salary on new input data, It predicts the salary with an error of either $8383 + $1449 = 9832 or $8383 - $1449= $6934.

Which method for measuring variability should you choose?

When you have a small dataset with similar values, consider using range as the measure of variability. The range is sensitive to outliers as it only depends on the extreme values in the data set, and it does not reflect how the data is distributed around the central tendency.

So, the range is the most commonly used for smaller datasets where the possibility of outliers is rare and when you don’t have enough data to calculate the other measure of variability.

While standard deviation and Interquartile range provide more information about the variability of the data, these measures are based on the average distance of the data values from a measure of central tendency such as mean or median.

The standard deviation is commonly used with the mean as a measure of variability for Normally Distributed or almost bell-shaped data.

When you have a Skewed Distribution, the median is a better measure of central tendency, and it is commonly used with interquartile range or other percentiles as they are robust to outliers and have almost no effect on them.

Although variance can be difficult to interpret as a measure of variability, it is still commonly used in various statistical tests. However, it is not used as a measure of variability to understand dispersion in data.

Summary

Measures of variability are important statistics that help us understand how data are spread and how much they differ from each other. They can also help us compare different model performances and to make informed decisions based on data. In this post, you learned the meaning of variability, the different types of variability, and how to calculate some common measures of variability, such as range, interquartile range, variance, and standard deviation.

🙏 Thank you for reading this blog post about measures of variability. I hope you learned something new and useful from it.

If you enjoyed this post, please leave a comment below and share your thoughts. I would love to hear your feedback on this post. What did you learn from it? How do you use measures of variability in your own data science projects?

I appreciate your time and attention. Stay tuned for more posts on statistics and data science. Have a great day!

References

--

--