Data’s Dance of Diversity: The Essence of Dispersion

Akash Srivastava
7 min readSep 22, 2023

--

Google Image

To understand data well, it is not enough to study only about the central tendency of data or probability. We have to understand the concept about the variety of data or how the data is spread or scatter from then central tendency. To study this, we need to study what is dispersion ? In this blog we will understand what is the dispersion, how it works. Let’s start,

What is Dispersion ?

Dispersion of data, used to understand the distribution of data. It helps to understand the variation of data and provides a piece of information about the distribution data.

When studying data dispersion, we aim to answer questions like:

  • How spread out are the data values?
  • Are the data points concentrated or widely scattered?
  • Are there any outliers in the dataset?

Measures of Data Dispersion

A measure of dispersion is a way to tell how spread out or scattered data is in a set of numbers. It helps us understand how much the numbers vary from one another. In simple terms, it tells us if the numbers are close together or far apart.

Let’s understand it with example :

Imagine you’re a teacher, and you want to know how well your students did on a recent math test. You have two classes: Class A and Class B. Here are their scores:

Class A: 80, 82, 83, 81, 84

Class B: 60, 95, 70, 75, 65

Now, you want to compare the two classes. To do that, you can use a measure of dispersion. Let’s calculate the range for each class:

  • Class A Range: 84 (highest score) — 80 (lowest score) = 4
  • Class B Range: 95 (highest score) — 60 (lowest score) = 35

In this case, Class B has a much larger range than Class A, which means the scores in Class B are more spread out or dispersed. This helps you understand that in Class B, some students did exceptionally well, while others didn’t do as well. On the other hand, in Class A, the scores are closer together, indicating more consistent performance among the students.

Most common measures of Data dispersion:

  1. Range : A range is the difference between the highest value and lowest value in the dataset, which we tell how the data is distributed. We can easily formulate it.

Range = Highest value — lowest value .

Let’s understand it with example:

Imagine we have temperature data for a particular location over a week. We’ll calculate the range of temperature values and visualize it using a line plot. Please note that this example is hypothetical, and you can replace the data with real-world data for your specific analysis.

Let’s use python for calculating and visualize it.

import matplotlib.pyplot as plt

# Sample temperature data for a week (in Celsius)
temperature_data = [23, 24, 22, 26, 27, 25, 28]

# Calculate the range
temperature_range = max(temperature_data) - min(temperature_data)

# Create a line plot to visualize the temperature data
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

plt.figure(figsize=(8, 4))
plt.plot(days_of_week, temperature_data, marker='o', color='green', linestyle='-', linewidth=2, markersize=8)
plt.xlabel("Day of the Week")
plt.ylabel("Temperature (°C)")
plt.title("Weekly Temperature Variation")
plt.grid(True)
plt.ylim(min(temperature_data) - 2, max(temperature_data) + 2) # Adjust y-axis limits for better visualization

# Annotate the plot with the temperature range
plt.annotate(f"Range: {temperature_range}°C", xy=(3, max(temperature_data) - 1), fontsize=12, color='red')

plt.show()
OUTPUT

2. Variance : Variance is a measure that tells us how spread out or scattered a set of numbers or data points is. It helps us understand how much individual data points differ from the average or mean value. It is denoted by σ2 (sigma square). Let’s see the formula of it.

Where N is the population size and the X are data points and μ is the population mean and n is the sample size and X are the data points and x̄ (X-bar) is the sample mean.

Let’s understand it with example:

Imagine you have a group of students, and you want to know how different their test scores are from the class average. You calculate the average score, and then you look at each student’s score and see how much it deviates (differs) from that average. Variance quantifies this deviation.

Let’s solve the above code using python and visualize it.

import matplotlib.pyplot as plt
import numpy as np

# Sample test scores for a group of students
test_scores = [85, 92, 88, 78, 90, 95, 87, 82, 91, 89]

# Calculate the mean (average) score
mean_score = np.mean(test_scores)

# Calculate variance
variance = np.var(test_scores)

# Create a histogram to visualize the distribution of scores
plt.hist(test_scores, bins=5, color='lightblue', edgecolor='black')
plt.xlabel("Test Scores")
plt.ylabel("Frequency")
plt.title("Distribution of Test Scores")
plt.axvline(mean_score, color='red', linestyle='dashed', linewidth=2,
label='Mean Score')
plt.legend()

# Display variance as a text annotation on the plot
plt.annotate(f"Variance: {variance:.2f}", xy=(60, 3), fontsize=12,
color='green')

plt.show()
Output

By seeing the above diagram, we can understand the distribution of test score w.rt to mean score.

3. Standard Deviation: Standard deviation is a squared root of the variance to get original values. Low standard deviation indicates data points close to mean. It is like a measure of how much things in a group differ from the middle or average value, and it helps us understand how spread out or consistent our data is.

Standard Deviation = √(σ^2)

Let’s understand it with example:

Imagine you have a bunch of points on a line, and you want to know how far each point is from the center of that line. The standard deviation gives you a single number that represents, on average, how far away each point is from the center.

Let’s solve the above code using python and visualize it.

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Generate a sample dataset
np.random.seed(42)
data = np.random.normal(loc=0, scale=1, size=1000)

# Create a DataFrame
df = pd.DataFrame({'Data': data})

# Visualize the data using a histogram
sns.histplot(df['Data'], kde=True)
plt.title('Histogram of Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

# Calculate and print variance and standard deviation
variance = df['Data'].var()
std_deviation = df['Data'].std()
print(f'Variance: {variance}')
print(f'Standard Deviation: {std_deviation}')
Output

4. Interquartile Range (IQR): Interquartile Range (IQR) is a way to measure how spread out the middle portion of a set of numbers or data points is. It helps us understand the range of values that most of the data falls within.

IQR = Third Quartile — First Quartile

Let’s understand it with example:

To explain it easily, think of a group of numbers arranged in order from smallest to largest. The IQR focuses on the middle part of these numbers, specifically the range between the 25th percentile and the 75th percentile. In other words, it tells us how much the middle 50% of the data spreads out.

Let’s solve the above code using python and visualize it.

import matplotlib.pyplot as plt
import numpy as np

# Sample data (you can replace this with your own dataset)
data = [12, 15, 18, 22, 24, 26, 28, 30, 35, 40, 60]

# Calculate quartiles
first_quartile = np.percentile(data, 25)
third_quartile = np.percentile(data, 75)

# Calculate IQR
iqr = third_quartile - first_quartile

# Filter data within the IQR range
data_within_iqr = [x for x in data if first_quartile <= x <= third_quartile]

# Create a box plot to visualize the data within the IQR
plt.boxplot(data_within_iqr, vert=False)
plt.xlabel("Data")
plt.title("Box Plot with IQR (Middle 50%)")
plt.annotate(f"IQR: {iqr:.2f}", xy=(iqr + 5, 1), fontsize=12, color='green')

plt.show()
Output

If the IQR is small, it means that most of the data points are close together in the middle. If the IQR is large, it means the middle data points are more spread out.

You can connect with me, I’m attaching my social media links below:

https://www.linkedin.com/in/akash-srivastava-1595811b4/

https://www.instagram.com/black_knight______________/

https://www.facebook.com/akash.shrivastava.963871

--

--

Akash Srivastava

Data Science || Machine Learning || Deep Learning ||Python Developer||TCSER