Outlier Detection in Machine Learning

Published in

Analytics Vidhya

5 min readDec 13, 2021

What are Outliers ?

Outliers are the datapoints which are significantly different from the rest of the datapoints in the dataset. Outliers increase the variance in the dataset which inturn results in the decrease in the statistical power. So it is very important to identify these outliers and treat them accordingly.

Reasons for the occurrence of Outliers

Outliers can occur because of various reasons. Some of the most common reasons include:

Error in data entry.
Inappropriate scaling of datapoints.
Errors caused during measurement.
Existence of genuine extreme data points.

Importance of Outlier Detection

Now that we know the reasons for the occurrence of the outliers, it is also important to know why is it important to identify these outliers. Simple reason is that some of the measures of central tendency and measures of variability are affected by outliers.

Mean: Since mean is the average of all the values present in the dataset, it is affected by the presence of outliers. The mean shifts towards the outlier.
Median: Median is the middle value of the dataset and is not affected the presence of outliers. So we should use median instead of mean when we are dealing with the datasets consisting of outliers.
Mode: Mode is the value that occurs maximum number of times in the dataset and is not affected by the outliers.
Variance & Standard Deviation: Since mean is used to calculate both variance and standard deviation, both get affected by the outliers.
Range: Since range is the difference between the minimum and maximum datapoints, range gets most affected by the presence of outliers.

Therefore except the median and mode, most of the other important measures get affected by the presence of outliers. Apart from these reasons, outliers also cause problems while fitting models and increase the errors. So it is very important to identify these outliers.

Methods to identify the Outliers

1. IQR Method

Inter Quartile Range (IQR) is the middle 50% of the dataset. In other words, it is the difference between the third quartile(75th percentile) and first quartile(25th percentile) value of the dataset.

IQR = Q3 -Q1

Lower Bound = Q1–1.5*IQR

Upper Bound = Q3+1.5*IQR

The IQR finds the lower and upper bound to identify the outliers. Any value which is 1.5 times above or below these thresholds is identified as an outlier.

2. Z-Score

Z-score tells us how many standard deviations above or below the mean a datapoint lies. It assumes that the datapoints follow a gaussian distribution.

Z-Score = (X-mean) / Standard deviation

From the above normal distribtion figure we can we that:

68% of the data lies within 1 standard deviation.
95% of the data lies within 2 standard deviations.
99.7% of the data lies within 3 standard deviations.

Since majority of the datapoints (99.7%) lie within 3 standard deviations above or below the mean, any Z-score more than +3 or any Z-score less than -3 is considered as an outlier.

3. Visualization

Presence of outliers can also be detected by using various visualization methods. Some of the famous plots include:

Scatter plot
Box and Whisker plot
Histogram
Distribution Plot
QQ plot

4. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering method that is used to separate clusters of high denstiy from the clusters of low density. It basically divides the datapoints into Core Points, Border Points or Noise Points. Here Noise points are are the outliers.

5. Hypothesis Testing

We can also use hypothesis testing to identify the outliers in the dataset. Some of the famous hypothesis tests for outlier detection include:

Grubbs’ test
Chi –square test
Dixon’s Q test

Each of these above mentioned tests use different methods to identify outliers. In Grubb’s test, we assume that:

Null hypothesis: All datapoints in the sample were drawn from a single population that follows the same normal distribution.

Alternate hypothesis: One datapoint in the sample was not drawn from the same normally distributed population as the other datapoints.

If the p-value is less than the significance level, we can reject the null hypothesis and conclude that one of the values is an outlier.

Handling Outliers

So now that we know how and why to detect outliers, the next question that arises is that what to do with these outliers.

One simple way is to just drop the outlier, but this is not correct for all scenarios. based on the use case we need to decide whether to drop an outlier or not.

When to drop an outlier?

When we know for sure that the outlier is completely wrong.
When we have large amount of data.
When we can revert back to the original data, if in later stage we find that the dropping of the outlier wasn’t a good idea.

When not to drop an outlier?

When there a lot of outliers.
When dealing with a sensitive critical use case.

What to do with the undroppable outliers?

Impuation: We can replace the outlier values with the mean, median or mode value based on the use case.
Quantile-based Flooring and Capping: In this technique,we can do the flooring (e.g., replacing with the 10th percentile) for the lower values and capping (e.g.,replacing with the 90th percentile) for the higher values.

Conclusion

Outliers are hard to handle, but cannot be ignored

Outlier detection and handling is one of the main step in data preprocessing and cannot be ignored. Ignoring the outliers will result in skewing of the data and we may not end up with the desired output.

Please do clap and share if you like this article! Happy reading!