Descriptive Statistics with Python — Learning Day 3

Describing Data with Averages

Gianpiero Andrenacci
Data Bistrot
15 min readJul 17, 2024

--

Descriptive Statistics with Python — All rights reserved

One of the fundamental aspects of data analysis is summarizing data sets using descriptive statistics. Among these statistics, measures of central tendency — often referred to as averages — are key in providing a snapshot of the data. In this article, we will explore the three primary types of averages: mode, median, and mean. We will also discuss which average to use in various scenarios and how to apply these concepts to qualitative and ranked data using Python.

Mode

The mode is the value that appears most frequently in a data set. It is the simplest measure of central tendency and can be used with both numerical and categorical data.

import statistics

data = [1, 2, 2, 3, 4, 4, 4, 5]
mode_value = statistics.mode(data)
print(f"The mode is: {mode_value}")
The mode is: 4

In this example, the mode of the data set is 4, as it appears more frequently than any other value.

Handling Multiple Modes

Sometimes a dataset might have more than one mode. This occurs when multiple values have the same highest frequency.

For example: {1,2,2,3,3,4,5}

In this dataset:

  • The number 2 appears 2 times.
  • The number 3 also appears 2 times.

Both 2 and 3 are modes of the dataset, making it a bimodal dataset. If there are more than two modes, the dataset is described as multimodal.

Importance of the Mode

The mode is particularly useful in situations where the most common item, score, or event is of interest. For instance:

  • In a survey of preferred ice cream flavors, the mode will show the most popular flavor.
  • In retail, the mode can indicate the most frequently sold product size or color.
  • In education, the mode of test scores can help identify the most common performance level among students.

Median

The median is the middle value of a data set when it is ordered in ascending or descending order. If the data set has an even number of observations, the median is the average of the two middle numbers. The median is particularly useful when dealing with skewed distributions, as it is not affected by outliers.

In a dataset, the median provides a better measure of central tendency than the mean when the data contains extreme values. For example, in a dataset of incomes where most values are clustered around a certain range but a few values are extremely high, the mean will be disproportionately high due to these outliers. The median, on the other hand, will remain representative of the central part of the dataset, providing a more accurate reflection of the typical value.

data = [1, 2, 2, 3, 4, 4, 4, 5]
median_value = statistics.median(data)
print(f"The median is: {median_value}")
The median is: 3.5

Understanding Median in Percentile Terms

The median is a measure of central tendency that represents the middle value of a data set when it is ordered in ascending or descending order. In percentile terms, the median is equivalent to the 50th percentile. This means that 50% of the data values lie below the median, and 50% lie above it.

Percentiles are measures that indicate the relative standing of a value within a data set. They divide the data into 100 equal parts, with each part representing 1% of the data. For example:

  • The 25th percentile (also known as the first quartile, Q1) indicates that 25% of the data values are below this point.
  • The 50th percentile (median) indicates that 50% of the data values are below this point.
  • The 75th percentile (also known as the third quartile, Q3) indicates that 75% of the data values are below this point.

Median as the 50th Percentile

When we say that the median is the 50th percentile, we mean that it is the value below which 50% of the data falls. This is particularly useful when dealing with skewed distributions, as the median provides a better measure of central tendency than the mean, which can be affected by extreme values.

To further illustrate, consider a data set visualized as a sorted list or on a number line. The median is the point that splits this ordered list into two equal halves.

import matplotlib.pyplot as plt
import numpy as np

data = sorted([1, 2, 2, 3, 4, 4, 4, 5])
median_value = np.percentile(data, 50)

plt.plot(data, 'o')
plt.axhline(y=median_value, color='r', linestyle='-')
plt.title('Data Points and Median')
plt.xlabel('Index')
plt.ylabel('Value')
plt.show()

In the plot, the red horizontal line represents the median (50th percentile), clearly showing that it divides the data into two equal halves.

See the following article of this series if you do not know what a skewed distribution is:

Mean

The mean, commonly known as the average, is calculated by summing all the values in a data set and dividing by the number of values. The mean is sensitive to outliers, which can skew the result. This sensitivity means that even a single extreme value can significantly affect the mean, making it less representative of the central tendency of the dataset.

Importance of the Mean

Despite its sensitivity to outliers, the mean is widely used due to its mathematical properties and ease of calculation. It is particularly useful in scenarios where all values are equally important and the data distribution is approximately normal (symmetric).

Applications of the Mean

  • Finance: Calculating average returns on investments.
  • Education: Determining the average score of students in a class.
  • Economics: Estimating the average income of a population.

Limitations of the Mean

  • Not Robust: As illustrated, the mean can be heavily influenced by outliers.
  • Not Always Representative: In skewed distributions, the mean may not accurately reflect the central tendency of the data.

Numerical Data and the Mean

Numerical data can be classified into two main types: interval data and ratio data.

Both types allow for the calculation of the mean, but they differ in terms of the properties and operations that can be performed.

Calculating the Mean for Numerical Data

The mean is particularly useful for interval and ratio data as it provides a central measure of tendency that summarizes the entire dataset. The mean is calculated by summing all the values in the dataset and dividing by the number of values.

Example of Mean Calculation with Python

Here’s a practical example of how to calculate the mean using Python:

import numpy as np

# Example dataset of numerical data (e.g., test scores)
test_scores = [88, 92, 76, 81, 95, 68, 74, 89, 93, 85]

# Calculate the mean using numpy
mean_score = np.mean(test_scores)

print(f"The mean test score is: {mean_score:.2f}")
The mean test score is: 84.10

NumPy is a powerful library for numerical computations in Python. It provides a convenient mean function to calculate the mean of a dataset.

The mean is a fundamental measure of central tendency used extensively with numerical data, especially interval and ratio data. Its calculation in Python is straightforward using libraries like NumPy, which facilitate efficient and accurate data analysis. However, it’s essential to be aware of the mean’s sensitivity to outliers and consider other measures like the median and mode for a comprehensive understanding of the dataset.

Which Average to Use?

Choosing the appropriate average depends on the nature of the data and the specific context of the analysis:

  • Mode: Use the mode for categorical data or to identify the most common value in a data set.
  • Median: Use the median for skewed distributions or when you want a measure that is not influenced by extreme values.
  • Mean: Use the mean for normally distributed data where all values are considered equally important.

Population and Sample in Descriptive Statistics

In descriptive statistics, we often encounter the terms population and sample. These concepts are fundamental to understanding how data is collected, analyzed, and interpreted. The difference between these two is crucial for accurately describing and making inferences about data.

Population

A population refers to the entire group of individuals or instances about whom we are seeking information. It includes every member of the group that we are studying. For example, if we are studying the heights of all students in a university, the population would be every single student enrolled in that university.

Population Mean (μ): This is the average of all data points in the population. For instance, if we could measure the height of every student in the university, the average height we calculate would be the population mean.

Sample

A sample is a subset of the population, chosen to represent the population. It is usually selected randomly to ensure that it is representative. Sampling is often necessary because it is impractical or impossible to collect data from every member of the population.

Sample Mean: This is the average of the data points in the sample. For example, if we measure the heights of 100 students out of the entire university, the average height of these 100 students would be the sample mean.

Why the distinction betwenn population and sample matters ?

Scope:

  • Population: Involves the entire group.
  • Sample: Involves only a part of the group.

Calculation of Means:

Population Mean (μ): Computed using all data points in the population. It is often a fixed value.

Sample Mean: Computed using data points in the sample. It can vary depending on which members of the population are included in the sample.

Perspective: The same set of data points can be viewed as either a population or a sample, depending on the context. For example, if we are analyzing the heights of all current students at a university, this is our population. However, if we consider these heights as part of a larger study involving future students, they could be viewed as a sample.

Practical Usage: In many cases, the population mean is unknown because it is impractical to measure every member of the population. Instead, researchers rely on sample means to estimate the population mean. This approach is the basis of inferential statistics.

Practical Example

Imagine we are studying the test scores of students in a large high school. If we collect scores from every student, we are dealing with the population, and the average score calculated is the population mean. However, if we randomly select 50 students and calculate the average of their scores, this is the sample mean.

Importance in Descriptive Statistics

The distinction between population and sample is vital for these reasons:

  • Representation: A sample needs to accurately represent the population to make valid inferences. Random sampling helps ensure this.
  • Accuracy: While the population mean is a fixed value, the sample mean can vary depending on the sample. This variability is accounted for in statistical analyses.
  • Generalization: In many studies, we collect data from a sample to make generalizations about the population. The methods used for analysis can differ depending on whether we are dealing with a population or a sample.

Averages for Qualitative data

Qualitative data, or categorical data, can be summarized using the mode since it represents the most frequent category.

data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
mode_value = statistics.mode(data)
print(f"The mode is: {mode_value}")

Here, the mode is ‘apple’ as it appears most frequently.

Skew and Central Tendency

To understand the relationship between skewness and measures of central tendency (mean, median, and mode) is important in data analysis. Skewness refers to the asymmetry of the probability distribution of a real-valued random variable about its mean. As we have seen, there are three primary types of distributions concerning skewness: normal (no skew), positively skewed (tail on the right), and negatively skewed (tail on the left).

Remember that:

  • Normal Distribution: The distribution will show a symmetric shape with the peak in the center, where the mean, median, and mode coincide.
  • Positively (Right) Skewed Distribution: The distribution will have a peak on the left with a tail extending to the right. Here, the mode is at the peak, the median is to the right of the mode, and the mean is to the right of the median.
  • Negatively (Left) Skewed Distribution: The distribution will have a peak on the right with a tail extending to the left. Here, the mean is at the leftmost, the median is to the right of the mean, and the mode is at the peak on the right.

Here’s how these distributions relate to central tendency measures:

Normal Distribution: Mean = Median = Mode

In a normal distribution, the data is symmetrically distributed around the mean. This means that the left and right sides of the distribution are mirror images of each other.

Because the distribution is symmetrical:

  • The mean, median, and mode are all equal.
  • They all lie at the center of the distribution.

This equality makes the normal distribution a convenient and frequently used model in statistics.

Generated with python

Note that the distribution is not perfectly normal and in the graph example mode is different from mean and median.

This is an example of an ideal normal distribution.

Positively (Right) Skewed Distribution: Mode < Median < Mean

A positively skewed distribution, also known as right-skewed, has a longer tail on the right side. This means that there are a number of unusually high values pulling the mean to the right.

  • Mode: The highest peak of the distribution, representing the most frequent value, is located at the leftmost part of the data.
  • Median: The middle value that separates the lower half from the upper half of the data is less affected by the extreme values but still shifts to the right.
  • Mean: The average is most influenced by the higher values in the tail and is thus pulled further to the right than the median.

In summary, for positively skewed distributions:

  • Mode is less than the median.
  • Median is less than the mean.

Negatively (Left) Skewed Distribution: Mean < Median < Mode

A negatively skewed distribution, also known as left-skewed, has a longer tail on the left side. This means that there are a number of unusually low values pulling the mean to the left.

  • Mode: The highest peak of the distribution, representing the most frequent value, is located at the rightmost part of the data.
  • Median: The middle value that separates the lower half from the upper half of the data is less affected by the extreme values but still shifts to the left.
  • Mean: The average is most influenced by the lower values in the tail and is thus pulled further to the left than the median.

In summary, for negatively skewed distributions:

  • Mean is less than the median.
  • Median is less than the mode.

Example with python for Normal and skew Distribution

The graphs above have been generated with the following code. This code uses the matplotlib library for plotting and the scipy.stats library to generate skewed distributions. The histogram_mode function calculates the mode based on the histogram bin with the highest frequency, and the plot_distribution function plots the data along with the mean, median, and mode. ​

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skewnorm

# Generate data for normal, positive skewed, and negative skewed distributions
np.random.seed(0)

# Normal distribution
normal_data = np.random.normal(loc=50, scale=10, size=1000)

# Positive skewed distribution (right skewed)
positive_skew_data = skewnorm.rvs(a=10, loc=50, scale=10, size=1000)

# Negative skewed distribution (left skewed)
negative_skew_data = skewnorm.rvs(a=-10, loc=50, scale=10, size=1000)

# Function to calculate mode using histogram bin with the highest frequency
def histogram_mode(data):
counts, bins = np.histogram(data, bins=30)
max_index = np.argmax(counts)
mode_value = (bins[max_index] + bins[max_index + 1]) / 2
return mode_value

# Function to plot data with mode, median, and mean
def plot_distribution(data, title):
mean_value = np.mean(data)
median_value = np.median(data)
mode_value = histogram_mode(data)

plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, alpha=0.6, color='skyblue', density=True)

plt.axvline(mean_value, color='r', linestyle='--', linewidth=1, label=f'Mean: {mean_value:.2f}')
plt.axvline(median_value, color='b', linestyle='-', linewidth=1, label=f'Median: {median_value:.2f}')
plt.axvline(mode_value, color='k', linestyle='-.', linewidth=1, label=f'Mode: {mode_value:.2f}')

plt.title(title)
plt.legend()
plt.show()

# Plot the distributions
plot_distribution(normal_data, 'Normal Distribution: Mean = Median')
plot_distribution(positive_skew_data, 'Positively Skewed (Right) Distribution: Mode < Median < Mean')
plot_distribution(negative_skew_data, 'Negatively Skewed (Left) Distribution: Mean < Median < Mode')

How to Calculate Skewness in Python

Pearson’s second skewness coefficient, also known as the median skewness, is a simple measure of skewness that utilizes the mean, median, and standard deviation of a data set. The formula for Pearson’s second skewness coefficient is:

Here’s how you can calculate this skewness coefficient in Python:

  1. Calculate the mean of the data.
  2. Calculate the median of the data.
  3. Calculate the standard deviation of the data.
  4. Use these values in the Pearson’s second skewness coefficient formula.

Below is a Python code snippet that demonstrates this calculation.

import numpy as np

# Example data
data = [1, 2, 2, 3, 4, 4, 4, 5, 6, 7, 8, 9]

# Calculate mean, median, and standard deviation
mean_value = np.mean(data)
median_value = np.median(data)
std_dev = np.std(data)

# Calculate Pearson's second skewness coefficient
skewness = 3 * (mean_value - median_value) / std_dev

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Standard Deviation: {std_dev}")
print(f"Pearson's Second Skewness Coefficient: {skewness}")
Mean: 4.583333333333333
Median: 4.0
Standard Deviation: 2.3964673074247345
Pearson's Second Skewness Coefficient: 0.7302415495417566

Kurtosis and Its Types

When we delve into statistical measures that describe the shape and characteristics of data distributions, ther is another important term: “kurtosis”. While skewness provides insight into the asymmetry of a distribution, kurtosis offers a different perspective, focusing on the tailedness of the distribution. This measure can reveal crucial information about the likelihood of extreme values or outliers in your data.

What is Kurtosis?

The concept of kurtosis, derived from the Greek word for “curved” or “arched,” was introduced by Karl Pearson, a British mathematician who spent his life exploring probability distributions. Kurtosis is a statistical measure that describes the shape of a distribution’s tails in relation to its overall shape. Tailedness is how often outliers occur.

In simple terms, it answers the question: how tall and sharp is the peak of the distribution, and how heavy are the tails?

High Kurtosis

High kurtosis in a dataset signals a distribution with heavy tails, meaning there is a higher frequency of extreme values or outliers. Imagine a distribution where the tails are thicker and more pronounced. While such a distribution might also exhibit a sharp, distinct peak, the key characteristic of high kurtosis is the weight of the tails. This pattern often indicates a greater likelihood of extreme events, which is crucial in fields like finance for risk assessment.

Low Kurtosis

Conversely, low kurtosis depicts a distribution with lighter tails, meaning extreme values are less frequent. Think of a distribution with a flatter, broader peak. In this scenario, the distribution has fewer outliers compared to a normal distribution. This means the likelihood of extreme events is lower. Understanding this can be essential in contexts where predicting and managing extremes is vital, as it indicates a more uniform spread of data around the mean.

Types of Kurtosis

To further appreciate the nuances of kurtosis, it’s helpful to classify distributions based on their kurtosis values. There are three primary types:

  1. Mesokurtic Distribution:

With a kurtosis value of exactly 3 (or excess kurtosis of 0), a mesokurtic distribution resembles a perfect normal distribution. The tails are moderate, and the tails are neither heavy nor light, representing a balanced scenario. This type is the benchmark against which other distributions are compared.

2. Leptokurtic Distribution:

A leptokurtic distribution has a kurtosis value greater than 3 (excess kurtosis > 0). Characterized by a heavy tails, it suggests a high concentration of data points around the mean and a significant presence of outliers. This type of distribution is often observed in datasets where extreme values are more common, such as financial returns or certain biological measurements. Leptokurtic distributions tend to have a more pronounced peak compared to the normal distribution.

3. Platykurtic Distribution:

With a kurtosis value less than 3 (excess kurtosis < 0), a platykurtic distribution features a flatter peak and lighter tails. This implies fewer data points around the mean and a lower frequency of extreme values. Such distributions are less prone to outliers and are often seen in uniform or gently varying datasets.

Visualization with python

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Generating data for different distributions
np.random.seed(0)
data_mesokurtic = np.random.normal(0, 1, 1000)
data_leptokurtic = np.random.laplace(0, 1, 1000)
data_platykurtic = np.random.uniform(-1, 1, 1000)

# Plotting the distributions
plt.figure(figsize=(12, 8))

# Mesokurtic distribution
plt.subplot(3, 1, 1)
plt.hist(data_mesokurtic, bins=30, density=True, alpha=0.6, color='g')
plt.title('Mesokurtic Distribution (Normal Distribution)')

# Leptokurtic distribution
plt.subplot(3, 1, 2)
plt.hist(data_leptokurtic, bins=30, density=True, alpha=0.6, color='r')
plt.title('Leptokurtic Distribution (Laplace Distribution)')

# Platykurtic distribution
plt.subplot(3, 1, 3)
plt.hist(data_platykurtic, bins=30, density=True, alpha=0.6, color='b')
plt.title('Platykurtic Distribution (Uniform Distribution)')

plt.tight_layout()
plt.show()

Kurtosis keypoints

High Kurtosis:

  • Sharp peakedness at the distribution’s center.
  • More values concentrated around the mean compared to a normal distribution.
  • Heavier tails due to a higher concentration of extreme values or outliers.
  • Greater likelihood of extreme events.

Low Kurtosis:

  • Flat peak.
  • Fewer values concentrated around the mean, though still more than in a normal distribution.
  • Lighter tails.
  • Lower likelihood of extreme events.

--

--

Gianpiero Andrenacci
Data Bistrot

AI & Data Science Solution Manager. Avid reader. Passionate about ML, philosophy, and writing. Ex-BJJ master competitor, national & international titleholder.