Z-Score to identify and remove outliers | Exploratory Data Analysis

Rina Mondal
3 min readJul 8, 2024

--

Photo by engin akyurt on Unsplash

A z-score, also known as a standard score, is a statistical measure that indicates how many standard deviations a data point is from the mean of the dataset. It is calculated using the formula:

Z=(X-mean)/standard deviation

The z-score tells you how far, in standard deviations, a particular data point is from the mean of the data. A positive z-score indicates that the data point is above the mean, while a negative z-score indicates that it is below the mean.

Use Z-Scores to identifying Outliers:
— Z-scores can help identify outliers by flagging data points that are far from the mean. Typically, values with z-scores beyond a certain threshold (often considered as ±3.0 or ±2.5) are considered outliers.

Using Z-Scores for Identifying Outliers:

1. Calculate Z-Scores:
— Calculate the z-score for each data point using the formula mentioned earlier.

2. Set a Threshold:
— Determine a threshold beyond which z-scores are considered outliers. A common threshold is ±3.0, but this can be adjusted based on the specific characteristics of your data.

3. Flag Outliers:
— Identify data points whose z-scores exceed the chosen threshold. These data points are considered outliers.

4. Consider Context:
— It’s important to interpret outliers in the context of your data and the goals of your analysis. Some outliers may be genuine data points that carry important information, while others may be errors or anomalies.

Example:

Let’s say you have a dataset of exam scores:

- Mean score = 75
- Standard deviation = 10

And you want to identify outliers using z-scores. If you set your threshold at ±3.0:

z = (x — 75)/10

If a particular score x results in a z-score greater than +3.0 or less than -3.0, you would consider x to be an outlier.

import numpy as np
from scipy.stats import zscore

# Example dataset
data = np.array([12, 15, 18, 22, 25, 30, 32, 35, 5000, 38, 40])

# Calculate z-scores for the dataset
z_scores = zscore(data)

# Set a threshold for identifying outliers
outlier_threshold = 3.0

# Identify outliers based on threshold
outliers_mask = np.abs(z_scores) > outlier_threshold

# Visualize the dataset with outliers highlighted
plt.scatter(np.arange(len(data)), data, c='b', label='Data')

# Remove outliers from the dataset
filtered_data = data[~outliers_mask]

print("Original dataset:", data)
print("Filtered dataset (without outliers):", filtered_data)

Z-scores provide a standardized way to identify outliers and assess the distribution of data points. They are particularly useful when you want to compare different datasets or variables with different scales, and they help in understanding the relative position of individual data points within a dataset. When using z-scores for outlier detection, it’s crucial to interpret the results in the context of your specific analysis and domain knowledge.

Blogs Related to Data Cleaning:

  1. Complete Data Cleaning.
  2. Remove Outliers using InterQuartile Range
  3. Using Log Transformation to mitigate the effect of outliers

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇. If you appreciate my hard work please follow me. That is the only way I can continue my passion.

--

--

Rina Mondal

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.