How to deal with outliers!

Piyush Borhade
3 min readJul 11, 2023

--

But first of all, what are outliers? 🤔

🧵 👇

An outlier is a data point which is distant from other points.

Outliers

Outliers have great significant impact on the performance and accuracy of machine learning models and if we do not remove the outliers from the data then it would give us wrong accuracy score which is basically minimised which eventually can impact our conclusion.
They can be caused by measurement errors, data entry errors, or even natural variations in the data.

How can we detect outliers?

The outliers of the data can be detected using certain statistical plots, the most common plots are Scatter Plot and Box Plot and other methods are Inter-Quartile Range(IQR), Z-score.

We will looking into all this methods.
Here you have 3️⃣ techniques to detect them 👇

1️⃣ Box Plots:

It provides a summary of key statistical measures and displays potential outliers.

We can clearly see outliers by using box-plots.

2️⃣ Interquartile range test (IQR):

In this method, we need to calculate the first quartile and third quartile to get the interquartile range(IQR). Then we will consider the first quantile minus 1.5 times IQR as the lower limit and the third quartile plus 1.5 times IQR as the upper limit of the data. If data is left skewed/ right skewed use IQR method.

IQR

3️⃣ Z-score test:

To calculate the Z-score for a data point, you need to know the mean (μ) and standard deviation (σ) of the dataset.
The formula for calculating the Z-score is as follows: Z = (X — μ) / σ

Where:
Z is the Z-score.
X is the data point.
μ is the mean of the dataset.
σ is the standard deviation of the dataset.

If data is normal distributed use Z-score method.

Things to keep in mind while choosing between Z-score and IQR technique:

~ If data is normal distributed use Z-score method

~ if data is left skewed/ right skewed use IQR method

Identifying outliers in data is crucial, but removing them requires caution.

⚠️Outliers hold valuable information, so consider their origin. While measurement errors may justify removal, natural variations should be retained to avoid misleading insights.

Also follow me on Linkedin: https://www.linkedin.com/in/piyush-borhade/

--

--

Piyush Borhade

heyyyy , you are looking for a profile of a person who is passionate about technology and wants to explore things.