What is an Outlier? How to handle and remove them? Algorithms that are affected by outliers.

Published in

Analytics Vidhya

8 min readSep 14, 2020

In statistics, an outlier is an observation point that is distant from other observations.

These extreme values need not necessarily impact the model performance or accuracy, but when they do they are called “Influential” points.

Note: An outlier is a data point that diverges from an overall pattern in a sample. An influential point is any point that has a large effect on the slope of a regression line.

Now the question arises that how we can detect these outliers and how to handle them?

Well before jumping straight into the solution lets explore that how the outliers being added to our dataset. What is the root cause of it.

Most common causes of outliers on a data set:

Data entry errors (human errors)
Measurement errors (instrument errors)
Experimental errors (data extraction or experiment planning/executing errors)
Intentional (dummy outliers made to test detection methods)
Data processing errors (data manipulation or data set unintended mutations)
Sampling errors (extracting or mixing data from wrong or various sources)
Natural (not an error, novelties in data)

Common Methods for Detecting Outliers

There are multiple methods to identify outliers in the dataset

Box plot
Scatter plot
Z-score method
IQR score

Box-Plot

In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.

The quickest and easiest way to identify outliers is by visualizing them using plots. If your dataset is not huge (approx. up to 10k observations & 100 features), I would highly recommend you build scatter plots & box-plots of variables. If there aren’t outliers, you’ll definitely gain some other insights like correlations, variability, or external factors like the impact of world war/recession on economic factors. However, this method is not recommended for high dimensional data where the power of visualization fails.

The box plot uses inter-quartile range to detect outliers. Here, we first determine the quartiles Q1 and Q3.

Interquartile range is given by, IQR = Q3 — Q1

Upper limit = Q3+1.5*IQR

Lower limit = Q1–1.5*IQR

Anything below the lower limit and above the upper limit is considered an outlier.

Scatter plot

A scatter plot , is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

As the definition suggests, the scatter plot is the collection of points that shows values for two variables. We can try and draw scatter plot for two variables from our dataset.

Looking at the plot above, we can most of data points are lying bottom left side but there are points which are far from the population like top right corner.

Z-score

The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.

This method assumes that the variable has a Gaussian distribution. It represents the number of standard deviations an observation is away from the mean:

Here, we normally define outliers as points whose modulus of z-score is greater than a threshold value. This threshold value is usually greater than 2 (3 is a common value).

The intuition behind Z-score is to describe any data point by finding their relationship with the Standard Deviation and Mean of the group of data points. Z-score is finding the distribution of data where mean is 0 and standard deviation is 1 i.e. normal distribution.

You must be wondering that, how does this help in identifying the outliers? Well, while calculating the Z-score we re-scale and center the data and look for data points which are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

We will use Z-score function defined in scipy library to detect the outliers.

Don’t be confused by the results. The first array contains the list of row numbers and second array respective column numbers, which mean z[55][1] have a Z-score higher than 3.

IQR score

The “interquartile range”, abbreviated “IQR”, is just the width of the box in the box-and-whisker plot. That is, IQR = Q3 — Q1 . The IQR can be used as a measure of how spread-out the values are.

Statistics assumes that your values are clustered around some central value. The IQR tells how spread out the “middle” values are; it can also be used to tell when some of the other values are “too far” from the central value. These “too far away” points are called “outliers”, because they “lie outside” the range in which we expect them.

The IQR is the length of the box in your box-and-whisker plot. An outlier is any value that lies more than one and a half times the length of the box from either end of the box.

That is, if a data point is below Q1–1.5×IQR or above Q3 + 1.5×IQR, it is viewed as being too far from the central values to be reasonable.

Code in python to calculate IQR score.

Correcting, removing the Outliers

Z-Score

In the previous section, we saw how one can detect the outlier using Z-score but now we want to remove or filter the outliers and get the clean data. This can be done with just one line code as we have already calculated the Z-score.

IQR Score -

Just like Z-score we can use previously calculated IQR score to filter out the outliers by keeping only valid values.

Algorithms that are sentitive to outliers

Linear Regression

Outliers has a dramatic impact on linear regression. It can change the model equation completely i.e bad prediction or estimation. Above we can see that the value of r have been changed with the addition of the outliers.

Logistic Regression

impact of outliers on logistic Regression

Logistic regression is affected by the outliers as we can see in the diagram above.

SVM

SVM is not very robust to outliers. Presence of a few outliers can lead to very bad global misclassification.

K-Nearest Neighbours (KNN)

Algorithm is sensitive to outliers, since a single mislabeled example dramatically changes the class boundaries. Anomalies affect the method significantly, because k-NN gets all the information from the input, rather than from an algorithm that tries to generalize data.

Proposal: Avoid very small number of neighbors (k=1, for example), especially if you are in front of noisy data, so always.

Naive Bayes

Yes outlier affect naive bayes. If a word that comes in testing data that has not been seen in training leads to zero probability of that particular word in the particular class. And we know in naive bayes we multiply probability of words lying in that particular class and results zero..that leads to wrong result.

Decision Tree

Decision tree are robust to Outliers trees divide items by lines, so it does not difference how far is a point from lines.

Random Forest

Random forest handles outliers by essentially binning them.

K-Means

The k-means algorithm updates the cluster centers by taking the average of all the data points that are closer to each cluster center. When all the points are packed nicely together, the average makes sense. However, when you have outliers, this can affect the average calculation of the whole cluster. As a result, this will push your cluster center closer to the outlier.

Example
The mean of 2,2,2,3,3,3,4,4,42,2,2,3,3,3,4,4,4 is 33
If we add a single 2323 to that, the mean becomes 55, which is larger than any of the other values.

Since in k-means, you’ll be taking the mean a lot, you wind up with a lot of outlier-sensitive calculations.

That’s why we have the k-medians algorithm. It just uses the median rather than the mean and is less sensitive to outliers.

Yet there are many ways to detect and correct the outliers but I covered the basic and important techniques once.

Happy reading :)