Descriptive Statistics with Python — Learning Day 4

Describing Variability

Published in

Data Bistrot

19 min readJul 24, 2024

Descriptive Statistics with Python — All rights reserved

Statistics thrives because variability is inherent in the world around us; no two entities are exactly the same, and some differences are more pronounced than others. When we summarize data, we need to account for both the central tendency, such as the mean, and the variability, which indicates how spread out the data points are in a distribution. This section explores several measures of variability, including the range, interquartile range, variance, and the all-important standard deviation.

Why Measure Variability?

Capturing Diversity: In any dataset, elements can vary significantly. For example, in a classroom, students’ heights may range from 150 cm to 190 cm. By measuring variability, we can capture this diversity, providing a more complete picture of the dataset.
Identifying Outliers: Variability helps us spot outliers — data points that differ significantly from the rest. For instance, if most students score between 70 and 90 on a test, but one student scores 30, this outlier affects the overall understanding of the class’s performance. Measures like the interquartile range and standard deviation help in identifying such anomalies.
Assessing Reliability: Understanding variability is crucial for assessing the reliability of data. For example, in manufacturing, if the diameter of produced screws varies too much, it indicates quality control issues. A low standard deviation suggests the process is stable and reliable.
Comparing Distributions: Variability allows us to compare different datasets. For instance, if two cities have the same average temperature but different variabilities, one city might experience more extreme weather conditions than the other. Comparing standard deviations helps in understanding these differences.

Examples of variability

Example 1: Student Test Scores Consider two classes with the same average test score of 75. Class A has scores [70, 72, 74, 76, 78], while Class B has scores [50, 60, 70, 80, 90]. Despite having the same mean, Class B’s scores are more spread out, indicating higher variability. This suggests that performance consistency differs significantly between the two classes.

Example 2: Daily Temperatures Imagine recording the daily temperatures in two cities over a month. City X has temperatures ranging from 68°F to 72°F, while City Y ranges from 60°F to 80°F. Although both cities may have similar average temperatures, City Y experiences more variability, indicating more extreme temperature fluctuations.

Example 3: Investment Returns Consider the returns of two investment portfolios over a year. Portfolio A has monthly returns close to 5%, while Portfolio B’s returns vary between -10% and 20%. Measuring the variability of returns helps investors understand the risk associated with each portfolio, with Portfolio B exhibiting higher risk due to greater variability.

Variability is a key aspect in data science as it provides insights into the distribution and spread of data points. Three fundamental measures used to describe variability are range, variance, and standard deviation.

Range

Range is the simplest measure of variability and represents the difference between the maximum and minimum values in a dataset.

Consider the dataset: [5, 10, 15, 20, 25]. To calculate the range:

The range is 20.

The range provides a quick sense of the spread of the data. However, it only considers the extreme values and may not be representative of the overall distribution.

Variance

Variance measures the average squared deviation of each data point from the mean. It provides a sense of how data points are spread around the mean.

Variance for a population:

N is the number of data points in the population

μ is the population mean

Variance for a sample:

x̄ is the sample mean

n is the number of data points in the sample

Variance provides a measure of the dispersion of data points. Larger variance indicates that data points are more spread out from the mean, while smaller variance indicates they are closer to the mean.

Adjusting the Variance a for Samples

When calculating variance, it’s important to understand the subtle but important distinction between formulas used for populations and those used for samples. Although the sum of squares term remains essentially the same in both cases, there is a critical adjustment in the denominator when dealing with samples.

Population vs. Sample: The Denominator Difference

For a population, the formulas for variance and standard deviation use N, the total number of data points in the population. However, for samples, the formulas use n−1 instead of n, where n is the sample size. This adjustment is essential for obtaining accurate estimates of population variability.

Why Use n−1 for Samples?

The reason for using n−1 instead of nnn in the denominator is to correct for the bias in the estimation of the population variance and standard deviation. This correction, known as Bessel’s correction, compensates for the fact that a sample tends to underestimate the true variability of the population. Generally Bessel’s correction is an approach to reduce the bias due to finite sample size.

Standard Deviation

Standard deviation is the square root of the variance and provides a measure of the average distance of each data point from the mean. Standard deviation is the most intuitive and widely used measures of this spread is the

You might find it helpful to think of the standard deviation as a rough measure of the average amount by which data points deviate from the mean. In other words, it gives you an idea of how much the values in your dataset differ from the average value.

Why Standard Deviation Matters

Imagine you have two sets of exam scores for two different classes. Both classes have the same mean score of 75. However, in the first class, most scores are close to 75, while in the second class, the scores vary widely, with some students scoring much higher or lower.

Here’s a quick comparison:

Class A Scores: [70, 72, 74, 76, 78]
Class B Scores: [50, 60, 70, 80, 90]

Although both classes have the same mean score, the variability in the scores is different. The standard deviation helps quantify this variability. In Class A, the scores are tightly clustered around the mean, resulting in a low standard deviation. In Class B, the scores are more spread out, leading to a higher standard deviation.

A smaller standard deviation indicates that the data points are closer to the mean, suggesting less variability within the dataset. Conversely, a larger standard deviation indicates more spread out data points, implying greater variability.

For example, if the standard deviation of exam scores in Class A is 3.16 and in Class B is 15.81, it clearly shows that Class B has more variability in scores compared to Class A (calculation for a sample).

Standard deviation helps you grasp the consistency of your data. For instance, in quality control processes, a low standard deviation in product measurements can indicate a high level of consistency and reliability. In finance, a high standard deviation in investment returns may signal higher risk.

Population or sample ?

Standard deviation and Variance can be calculated for an entire population or a sample, and the distinction between the two is pivotal for accurate data analysis and interpretation.

For a population the notation used is: σ

For a sample the notation used is: s

Standard deviation is widely used in data analysis because it is expressed in the same units as the original data, making it easier to interpret. It provides a more intuitive measure of variability than variance.

Remember that a population encompasses all members of a specified group. For example, if we are studying the heights of all students in a university, the population includes every student enrolled in the university.

A sample is a subset of the population, selected to represent the population. Sampling is necessary when it is impractical or impossible to collect data from every member of the population.

Consider a dataset of exam scores for all students in a school (population) and a randomly selected group of students (sample):

Population: If we have scores for all 200 students, we calculate the population variance and standard deviation.
Sample: If we randomly select 30 students, we calculate the sample variance and standard deviation to estimate the population parameters.

Practical Calculation in Python

import numpy as np

# Sample dataset
data = np.array([4, 8, 6, 5, 3])

# Range
data_range = np.max(data) - np.min(data)

# Variance
variance = np.var(data, ddof=1)  # ddof=1 for sample variance

# Standard Deviation
standard_deviation = np.std(data, ddof=1)  # ddof=1 for sample standard deviation

print(f"Range: {data_range}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {standard_deviation}")

Range: 5
Variance: 3.7
Standard Deviation: 1.9235384061671346

The ddof parameter stands for "Delta Degrees of Freedom." It is used in the calculation of the standard deviation and variance to adjust the divisor during the computation.

In the formula for standard deviation (and variance), the divisor is N−ddof, where N is the number of observations. The default value for ddof is 0, which means the divisor is N. However, when calculating the sample standard deviation or sample variance (as opposed to the population standard deviation or variance), the divisor is N−1.

The Role of Standard Deviation in Data Distribution

Majority of Scores Within One Standard Deviation

In many frequency distributions, a significant proportion of data points fall within one standard deviation of the mean. Specifically, for a normal distribution, approximately 68 %of all scores lie within one standard deviation (σ) on either side of the mean (μ). This means that most of the data points are clustered around the mean, indicating a relatively low level of variability within this range.

For example, if the mean score of a test is 75 with a standard deviation of 5, then approximately 68 percent of the test scores will fall between 70 (75–5) and 80 (75 + 5).

A Small Minority of Scores Deviate More Than Two Standard Deviations

Conversely, a small minority of data points lie beyond two standard deviations from the mean. For most frequency distributions, about 5 percent of all scores deviate more than two standard deviations (2σ) on either side of the mean. This indicates that extreme values, or outliers, are relatively rare.

Using the same test score example with a mean of 75 and a standard deviation of 5, only about 5 percent of the scores will be below 65 (75–10) or above 85 (75 + 10). These scores are considered unusual.

The empirical rule

The empirical rule, or the 68–95–99.7 rule, tells you where most of your values lie in a normal distribution:

Around 68% of values are within 1 standard deviation from the mean.
Around 95% of values are within 2 standard deviations from the mean.
Around 99.7% of values are within 3 standard deviations from the mean.

Let’s create an example using Python to illustrate the concept of a normal distribution and standard deviation. We’ll generate a dataset that follows a normal distribution and visualize it, showing the standard deviations as in the empirical rule (68–95–99.7 rule).

import numpy as np
import matplotlib.pyplot as plt

# Step 1: Generate a normal distribution dataset
np.random.seed(42)  # For reproducibility
mu = 1150  # Mean
sigma = 100  # Standard deviation
data = np.random.normal(mu, sigma, 1000)

# Step 2: Calculate the mean and standard deviation
mean = np.mean(data)
std_dev = np.std(data)

# Step 3: Visualize the data
plt.figure(figsize=(10, 6))
count, bins, ignored = plt.hist(data, bins=30, density=True, alpha=0.6, color='#66ffb3', edgecolor='white')

# Plot the normal distribution curve
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = (1 / (np.sqrt(2 * np.pi) * std_dev)) * np.exp(-0.5 * ((x - mean) / std_dev)**2)
plt.plot(x, p, 'k', linewidth=1)

# Highlight the standard deviations
plt.axvline(mean, color='blue', linestyle='dashed', linewidth=1)
plt.axvline(mean + std_dev, color='gray', linestyle='dashed', linewidth=1)
plt.axvline(mean - std_dev, color='gray', linestyle='dashed', linewidth=1)
plt.axvline(mean + 2*std_dev, color='skyblue', linestyle='dashed', linewidth=1)
plt.axvline(mean - 2*std_dev, color='skyblue', linestyle='dashed', linewidth=1)
plt.axvline(mean + 3*std_dev, color='purple', linestyle='dashed', linewidth=1)
plt.axvline(mean - 3*std_dev, color='purple', linestyle='dashed', linewidth=1)

# Adding annotations
plt.text(mean, -0.0005, 'Mean', rotation=90, verticalalignment='bottom', color='blue')
plt.text(mean + std_dev, - 0.0005, '+1 SD', rotation=90, verticalalignment='bottom', color='gray')
plt.text(mean - std_dev, -0.0005, '-1 SD', rotation=90, verticalalignment='bottom', color='gray')
plt.text(mean + 2*std_dev, -0.0005, '+2 SD', rotation=90, verticalalignment='bottom', color='skyblue')
plt.text(mean - 2*std_dev, -0.0005, '-2 SD', rotation=90, verticalalignment='bottom', color='skyblue')
plt.text(mean + 3*std_dev, -0.0005, '+3 SD', rotation=90, verticalalignment='bottom', color='purple')
plt.text(mean - 3*std_dev, -0.0005, '-3 SD', rotation=90, verticalalignment='bottom', color='purple')

plt.title('Normal Distribution with Standard Deviations')

plt.ylabel('Density')

plt.grid(False)
plt.show()

Explanation

Generating Data:

We use np.random.normal to create a dataset of 1000 values with a mean of 1150 and a standard deviation of 100.

Calculating Mean and Standard Deviation:

We calculate the mean and standard deviation using np.mean and np.std.

Visualizing Data:

We plot a histogram of the data.
We overlay a normal distribution curve using the calculated mean and standard deviation.
We highlight the mean and each of the standard deviations with vertical dashed lines and annotate them for clarity.

The Difference Between Mean and Standard Deviation

There is an essential distinction between the mean and the standard deviation:

Mean (μ): The mean is a measure of central tendency, indicating the average value of a dataset. It provides a position around which the data points are distributed.
Standard Deviation (σ): The standard deviation, on the other hand, is a measure of dispersion. It quantifies the average distance of each data point from the mean, indicating how spread out the data points are around the mean.

This distinction highlights that while the mean gives us the central point of the data, the standard deviation tells us about the spread or variability around this central point.

Outliers

Outliers are data points that are significantly different from the majority of observations in a dataset. They can arise due to measurement errors, data entry errors, or genuine variability in the data. Outliers can have a substantial impact on statistical analyses, often skewing results and leading to misleading interpretations. Identifying and handling outliers appropriately is critical for accurate data analysis.

Understanding Outliers in Data Analysis

Outliers are data points that deviate significantly from the majority of observations in a dataset. These anomalous values can arise due to various reasons. Recognizing and managing outliers is crucial for ensuring accurate and reliable data analysis.

Sources of Outliers

Measurement Errors

In a temperature dataset, a recording of 1000°C instead of 10°C could be due to a malfunctioning sensor. Such extreme values are unrealistic and indicate a measurement error.

2. Data Entry Errors

In a survey dataset, a respondent’s income might be mistakenly entered as $1,000,000 instead of $10,000. This typographical error results in an outlier that doesn’t represent the true population.

3. Genuine Variability

In financial data, a stock price might surge or plummet drastically due to market events. Such data points, while extreme, reflect real-world variability.

Impact of Outliers

Outliers can have a substantial impact on statistical analyses. Their presence can skew results, leading to misleading interpretations and potentially incorrect conclusions. Here’s how outliers can affect different aspects of data analysis:

Mean and Standard Deviation

In a dataset of house prices, if most houses are priced between $200,000 and $300,000 but one mansion is priced at $10 million, the mean price will be disproportionately high. This extreme value also increases the standard deviation, making the data appear more dispersed than it actually is.

2. Regression Analysis

When performing a linear regression to predict sales based on advertising spend, a single outlier with exceptionally high sales but low advertising spend can distort the regression line. This can lead to a model that poorly fits the majority of the data.

3. Clustering

In customer segmentation using clustering algorithms, outliers can form their own clusters or pull cluster centroids towards them. This can result in inaccurate segments that do not represent the true grouping of the majority of customers.

Identifying Outliers

Identifying outliers is a crucial step in data preprocessing. Various methods can be used to detect outliers, such as:

Visual Inspection: Plotting data using scatter plots or box plots can help visually identify data points that stand out from the rest. For instance, a box plot of salaries might show a few points far above the upper whisker, indicating potential outliers.

Statistical Methods: Calculating the Z-score (see later) for each data point to measure how many standard deviations it is from the mean.

IQR Method: Using the Interquartile Range (IQR) to define outliers (see later).

Handling Outliers

Once identified, handling outliers appropriately is essential for robust analysis. Different strategies can be employed depending on the context:

Removing Outliers: In a dataset of physical measurements, if outliers are identified as errors (e.g., negative height values), they can be removed to ensure the integrity of the analysis.

Transforming Data:Applying a log transformation to a positively skewed dataset (e.g., income data) can reduce the impact of outliers and make the data more normally distributed.

Imputation: Replacing outliers with the median or mean of the dataset. In a dataset of test scores, an outlier score of 0 might be replaced with the median score to avoid skewing the analysis.

Using Robust Statistical Methods: Employing robust regression techniques that are less sensitive to outliers. For instance, using the Huber loss function instead of the mean squared error in regression analysis to reduce the influence of outliers.

Detecting Outliers Using Z-Score

What is a Z-Score?

A Z-score (or standard score) is a statistical measurement that describes a data point’s relationship to the mean of a group of data points. It is expressed as the number of standard deviations away from the mean. The formula for calculating the Z-score of a data point is:

where:

Z is the Z-score we’re calculating.
X is the specific data point we want to evaluate.
μ (mu) is the mean of the dataset.
σ (sigma) is the standard deviation.

A Z-score tells you how many standard deviations a data point is from the mean.

A positive Z-score indicates the data point is above the mean, while a negative Z-score indicates it is below the mean.

Threshold Value in Z-Score

A threshold value is a predetermined cutoff point that helps determine what is considered an anomaly or outlier within a dataset. This value determines the significance level at which a Z-score is deemed unusual or different from the rest.

Common Threshold Values

Z-Score Greater than 2 (or Less than -2):

Moderate Outliers: Data points with Z-scores greater than 2 or less than -2 are considered unusual. They are significantly different from the mean but not extremely so. This threshold is often used for detecting moderate outliers.

If |Z| > 2, then the data point is a moderate outlier.

Z-Score Greater than 3 (or Less than -3):

Extreme Outliers: Data points with Z-scores greater than 3 or less than -3 are considered highly unusual. This stricter criterion is used to identify extreme outliers.

If |Z| > 3, then the data point is an extreme outlier.

Practical Application for z-score to detect outliers

The choice of threshold depends on the specific needs of your analysis and the level of sensitivity you want in detecting outliers. Using a higher threshold like 3 (or -3) will identify fewer data points as outliers, focusing on the most extreme cases. A lower threshold like 2 (or -2) will flag more data points as potentially unusual, allowing for a broader detection of anomalies.

Example of z-score calculation and outliers detections in python

Let’s consider the dataset of reaction times (in milliseconds) of airline pilots to a cockpit alarm that we have already used in

Descriptive Statistics with Python — Learning day 1

Data Types and Frequency Distributions

medium.com

import matplotlib.pyplot as plt
import pandas as pd

# Sample data: Reaction times in milliseconds (including clear outliers)
reaction_times = [300, 320, 330, 340, 350, 360, 1000, 370, 380, 390, 400, 1600, 1700]

# Create a DataFrame
df_rt = pd.DataFrame(reaction_times, columns=['Reaction_Time'])

# Calculate basic statistics
mean_rt = df_rt['Reaction_Time'].mean()
std_rt = df_rt['Reaction_Time'].std()

print(f"Mean Reaction Time: {mean_rt:.2f}")
print(f"Standard Deviation of Reaction Time: {std_rt:.2f}")

# Identify outliers using the Z-score method with a threshold of 2
df_rt['Z_Score'] = (df_rt['Reaction_Time'] - mean_rt) / std_rt
outliers = df_rt[df_rt['Z_Score'].abs() > 1]
print("Outliers detected:\n", outliers)

# Visualize the reaction times with outliers
plt.figure(figsize=(10, 6))

# Plot all data points
plt.plot(df_rt['Reaction_Time'], marker='o', linestyle='-', color='blue', label='Reaction Time')

# Highlight the outliers in red
plt.plot(outliers.index, outliers['Reaction_Time'], marker='o', linestyle='', color='red', label='Outliers')

# Add mean line
plt.axhline(y=mean_rt, color='green', linestyle='--', label='Mean')

# Add labels and title
plt.xlabel('Observation Index')
plt.ylabel('Reaction Time (ms)')
plt.title('Reaction Times with Outliers')

# Add legend
plt.legend()

# Show plot
plt.show()

Mean Reaction Time: 603.08
Standard Deviation of Reaction Time: 498.69
Outliers detected:
     Reaction_Time   Z_Score
11           1600  1.999092
12           1700  2.199618

In this example, we calculate the mean and standard deviation of reaction times and use the Z-score method to identify outliers. We then visualize the reaction times, highlighting the outliers in red.

Step-by-Step Explanation of the code

Calculate Statistics

First, we need to compute the mean and standard deviation of the reaction times. The mean provides us with the average reaction time, giving us a central value around which the data points are distributed. As we have seen, the standard deviation measures the amount of variation or dispersion from the mean. A small standard deviation indicates that the data points are close to the mean, while a large standard deviation indicates that they are spread out over a wider range of values.

To compute these statistics in Python, we use the mean() and std() methods from the pandas library.

Identify Outliers

With the mean and standard deviation in hand, we proceed to identify outliers by calculating the Z-score for each reaction time. The Z-score tells us how many standard deviations a data point is from the mean.

We have set Z-score threshold value to 2, in order to detect moderate outliers. This means the reaction time is more than two standard deviations away from the mean, indicating it significantly deviates from the rest of the data.

By computing the mean and standard deviation, identifying outliers using the Z-score method, and visualizing the results, we can gain valuable insights into the distribution of reaction times and detect any anomalies. Understanding outliers is a fundamental aspect of data analysis as they can significantly impact statistical measures and models.

The Interquartile Range (IQR)

In statistics, the Interquartile Range (IQR) is a measure of statistical dispersion, which is the spread of the data points in a data set. The IQR specifically measures the range within which the central 50% of the values lie, offering a focused look at the middle portion of the distribution.

What is the Interquartile Range (IQR)?

The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):

IQR=Q3−Q1

Here’s a breakdown of the key concepts:

First Quartile (Q1): Also known as the 25th percentile, it is the value below which 25% of the data falls.
Third Quartile (Q3): Also known as the 75th percentile, it is the value below which 75% of the data falls.

Why Use the IQR?

The IQR is particularly useful because it focuses on the middle 50% of the data, which is less affected by outliers and extreme values compared to the full range. This makes it a robust measure of dispersion for skewed distributions or data with outliers.

Calculating the IQR

Let’s go through a step-by-step example of how to calculate the IQR using Python. We will use a sample data set and the numpy and pandas libraries for this purpose.

Example Data Set

Consider the following data set of test scores:

scores=[55,61,63,68,70,72,78,80,85,90,95,98]

Step-by-Step Calculation

Sort the Data: Arrange the data in ascending order (if not already sorted).
Find Q1 and Q3: Calculate the first and third quartiles.
Compute the IQR: Subtract Q1 from Q3.

import numpy as np
import pandas as pd

# Sample data
scores = [55, 61, 63, 68, 70, 72, 78, 80, 85, 90, 95, 98, 150]

# Convert to a pandas Series
scores_series = pd.Series(scores)

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = scores_series.quantile(0.25)
Q3 = scores_series.quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

print("First Quartile (Q1):", Q1)
print("Third Quartile (Q3):", Q3)
print("Interquartile Range (IQR):", IQR)

First Quartile (Q1): 68.0
Third Quartile (Q3): 90.0
Interquartile Range (IQR): 22.0

Interpretation

From the output, the IQR of the sample data set is 22.0. This means that the middle 50% of the test scores are spread out over a range of 22 points.

Importance of IQR in Data Science

In data science, the IQR is widely used for identifying outliers. Any value that lies below Q1–1.5 * IQR or above Q3 + 1.5 * IQR is often considered an outlier. This method provides a systematic way to detect and handle outliers in data preprocessing.

Outlier Detection with IQR

Let’s see how to detect outliers using the IQR in Python.

# Define the outlier cutoff thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = scores_series[(scores_series < lower_bound) | (scores_series > upper_bound)]

print("Outliers:", outliers.tolist())

Outliers: [150]

In this example, the outlier is 150 in the given data set according to the IQR method.

Visualizing Interquartile Range (IQR) with a Box Plot

A box plot (also known as a whisker plot) is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It is particularly useful for visualizing the Interquartile Range (IQR) and identifying outliers.

Let’s create a box plot using Python to visualize the IQR and identify potential outliers in a data set. We’ll use the matplotlib and seaborn libraries for plotting.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample data
scores = [55, 61, 63, 68, 70, 72, 78, 80, 85, 90, 95, 98, 150]

# Convert to a pandas Series
scores_series = pd.Series(scores)

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = scores_series.quantile(0.25)
Q3 = scores_series.quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

print("First Quartile (Q1):", Q1)
print("Third Quartile (Q3):", Q3)
print("Interquartile Range (IQR):", IQR)

# Create a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x=scores_series, color="skyblue")

# Add titles and labels
plt.title('Box Plot of Test Scores')
plt.xlabel('Scores')

# Display the plot
plt.show()

The box plot will visualize the distribution of the test scores, highlighting the median, Q1, Q3, and any potential outliers.

Interpretation of the Box Plot

Box: The central box represents the IQR, showing the range within which the central 50% of the values lie.
Median Line: A line inside the box indicates the median (50th percentile) of the data.
Whiskers: The whiskers extend from the box to the smallest and largest values within 1.5 * IQR from Q1 and Q3, respectively.
Outliers: Data points outside the whiskers are considered outliers.

Next day of the series:

Descriptive Statistics with Python — Learning Day 5

Correlation and causation

medium.com

Previous day of the series:

Descriptive Statistics with Python — Learning Day 3

Describing Data with Averages

medium.com

Descriptive Statistics with Python — Learning Day 4

Describing Variability

Why Measure Variability?

Examples of variability

Range

Variance

Adjusting the Variance a for Samples

Population vs. Sample: The Denominator Difference

Why Use n−1 for Samples?

Standard Deviation

Why Standard Deviation Matters

Population or sample ?

Practical Calculation in Python

The Role of Standard Deviation in Data Distribution

Majority of Scores Within One Standard Deviation

A Small Minority of Scores Deviate More Than Two Standard Deviations

The empirical rule

Explanation

The Difference Between Mean and Standard Deviation

Outliers

Understanding Outliers in Data Analysis

Sources of Outliers

Impact of Outliers

Identifying Outliers

Handling Outliers

Detecting Outliers Using Z-Score

What is a Z-Score?

Threshold Value in Z-Score

Common Threshold Values

Z-Score Greater than 2 (or Less than -2):

Z-Score Greater than 3 (or Less than -3):

Practical Application for z-score to detect outliers

Example of z-score calculation and outliers detections in python

Descriptive Statistics with Python — Learning day 1

Data Types and Frequency Distributions

Step-by-Step Explanation of the code

Calculate Statistics

Identify Outliers

The Interquartile Range (IQR)

What is the Interquartile Range (IQR)?

Why Use the IQR?

Calculating the IQR

Step-by-Step Calculation

Interpretation

Importance of IQR in Data Science

Outlier Detection with IQR

Visualizing Interquartile Range (IQR) with a Box Plot

Interpretation of the Box Plot

Descriptive Statistics with Python — Learning Day 5

Correlation and causation

Descriptive Statistics with Python — Learning Day 3

Describing Data with Averages

Written by Gianpiero Andrenacci