ChatGpt Solution to Find Outliers Using Python

G.S. Baloch, PhD
4 min readMay 3, 2023

--

I asked ChatGPT to provide me with a straightforward explanation of outliers and a Python code to address them. As a result, it presented me with the following solution.

Outliers are values that are significantly different from the rest of the data in a dataset. They can have a significant impact on statistical analyses and machine learning algorithms, which is why it’s important to identify and handle them appropriately. In this article, we’ll explore three statistical concepts — mean, standard deviation, and z-score — that are commonly used to identify outliers, and show you how to implement them in Python using the Iris dataset.

Mean and Standard Deviation

The mean is a measure of central tendency that represents the average value of a dataset. It’s calculated by summing up all the values in the dataset and dividing the result by the number of values. The mean is a useful statistic for understanding the overall distribution of data, but it doesn’t tell us anything about how spread out the data is.

That’s where standard deviation comes in. The standard deviation is a measure of dispersion that indicates how much the values in a dataset deviate from the mean. It’s calculated by first calculating the variance — which is the average of the squared differences between each value and the mean — and then taking the square root of the variance. A higher standard deviation indicates that the values in the dataset are more spread out, while a lower standard deviation indicates that they’re closer together.

Together, the mean and standard deviation can be used to identify outliers in a dataset. One common method is to define an outlier as any value that falls outside of a certain number of standard deviations from the mean. For example, if we define an outlier as any value that’s more than three standard deviations away from the mean, then any value that falls outside the range (mean — 3standard deviation, mean + 3standard deviation) would be considered an outlier.

Z-score

Another way to identify outliers is by using the z-score. The z-score is a measure of how many standard deviations away from the mean a value is. It’s calculated by subtracting the mean from the value and then dividing the result by the standard deviation. A z-score of 0 indicates that the value is equal to the mean, while a z-score of 1 indicates that the value is one standard deviation above the mean.

Like with the mean and standard deviation method, we can use a threshold z-score value to identify outliers. A common threshold value is 3 — any value with a z-score greater than 3 (or less than -3) is considered an outlier.

Implementing in Python using Iris dataset

Step-01: Import Dataset. Find mean, standard deviation and z-score.

import numpy as np
# Extract the sepal length data
sepal_length = iris.data[:, 0]
# Calculate the mean and standard deviation
mean = np.mean(sepal_length)
std = np.std(sepal_length)

We first load the Iris dataset and calculate the mean and standard deviation for the sepal length feature.

from scipy import stats
# Calculate the z-score for each data point
z_score = stats.zscore(sepal_length)

We then calculate the z-score for each sepal length value and define a threshold value for outliers.

Step-02: Find and print the outliers.

# Set a threshold for outlier detection (e.g. z-score > 2 or < -2)
threshold = 2
# Print the indices and values of the outliers
print('Indices of outliers:', outliers[0])
print('Values of outliers:', sepal_length[outliers[0]])

We find and print the outliers using a boolean mask that compares the z-scores to the threshold value.

Indices and Values of Outliers

Step-03: Visualize the outliers

import matplotlib.pyplot as plt
# Plot the distribution of sepal length with outliers highlighted
plt.hist(sepal_length, bins=20)
plt.axvline(mean, color='red', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(mean - std, color='orange', linestyle='dashed', linewidth=2, label='Standard Deviation')
plt.axvline(mean + std, color='orange', linestyle='dashed', linewidth=2)
plt.axvline(mean - threshold*std, color='purple', linestyle='dashed', linewidth=2, label='Outlier Threshold')
plt.axvline(mean + threshold*std, color='purple', linestyle='dashed', linewidth=2)
plt.scatter(sepal_length[outliers[0]], np.zeros_like(sepal_length[outliers[0]]), color='red', label='Outliers')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.title('Distribution of Sepal Length in Iris Dataset')
plt.legend()
plt.show()

To visualize the outliers, we create a scatter plot of the sepal length values with the outliers highlighted in red. We also add horizontal lines to show the mean and threshold values. The resulting plot should look something like this:

Visualizing Outliers

Conclusion

As you can see, there are outliers that fall outside the threshold range (mean +/- 2*standard deviation). By identifying and handling these outliers appropriately, we can improve the accuracy and reliability of any statistical analyses or machine learning models that we build with this dataset.

--

--

G.S. Baloch, PhD

Dedicated and proactive data science professional with 10+ years of teaching and research experience.