Detecting Outliers and its Treatment

Experiments and observations

Nilesh Barla

Published in

Analytics Vidhya

8 min readMar 13, 2020

Source: https://www.pexels.com/photo/photo-of-herd-of-sheep-on-grassland-3379771/

Introduction

Outliers are data points in a population that does not belong to that population. For example, a black sheep in the herd of white sheep.

Outliers can skew your results. Or we can say that it shifts the behavior of the data from the results what was supposed to be true to results that are not true. We call that shift as the Error.

Error is the difference between the actual results and the predicted results. Predicted results that come from the Machine Learning model can be affected if there are unwanted data present it in. To make sure that the Machine Learning model makes correct predictions, we have to ensure that we deal with the outliers appropriately.

In this article, I will try to help you to build an intuition as to how to approach a Machine Learning problem while dealing with outliers.

To understand the problem I will be using a univariant dataset — which will have one independent variable and a dependent variable.

Note: The codes are available in my Github Repo. Feel free to check. And another important point is that I am not trying to explore the ML model itself but rather to design and build a model to understand the effects of outliers in a ML model and also to obverse which model make the best relationship between the independent and dependent variables.

Finding Outliers

Outliers are not difficult to find. We can use a statistical method to find traces of outliers or outliers itself. The tool that we will be studying is Standard Deviation, inter-quartile range, and boxplot.

Standard Deviation is a measure of variance — or the spread of data — from the mean which is the center of the scale also denoted as ‘0’. Most of the data reside between the first and second standard deviation. Beyond that everything is considered to be an outlier.

In general, we try to remove those data points which are on and beyond the 2nd standard deviations.

Note: When we are talking about standard deviation we take into consideration both positive and negative standard deviation together.

The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively — i.e. 25, 50 and 75 percentile.

Once we have these values we can then set a threshold value beyond and below which are the outliers.

The image below is of a box plot. “A boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points” — Wikipedia.

Once we find the outliers we can then treat them accordingly.

Dealing with Outliers

There are several ways in which we can deal with outliers. But before we take any sort of measures to deal with outliers we should remember certain points.

Outliers can be good and bad as well. It is subjective to the problem that we are facing. Some problems need outliers for example anomaly detection while others don’t because it skews the data.
Removing outliers can also reduce the number of observations which can lead to overfitting — where the model is too complex to learn the simplicity of the data provided. We have to ensure that the parameters of the model and the data are correctly balanced.

We will be focusing on the second point as the first point is out of the scope of this article.

Now, there are certain ways of dealing with outliers.

Removing.
Scaling.
Using a non-parametric model such as Random Forest to reduce the effect.
Using deep neural networks for the same.

We will be covering all the points mentioned above to see which one the above method will provide us a good outcome.

Removing The Outliers

We can use the interquartile range to remove the outliers.

q25, q75 = np.percentile(data.carat, 25), np.percentile(data.carat, 75)
iqr = q75 - q25; iqr
cut_off = iqr * 2 
lower, upper = q25 - cut_off, q75 + cut_off
outliers = [x for x in data.carat if x < lower or x > upper]#removing the data from the datasetdata = data.drop(data[(data.carat > upper) | (data.carat<lower)].index)

Note: The cut-off variable is the product of mean and standard deviation.

To check whether the unnecessary data point has been removed from the data. We will use two model one is a parametric model — Stochastic Gradient Descent Regressor — and a non-parametric model.

I will be using the vanilla form of all the models that I will be implementing. The aim is to see whether the preprocessing technique works or not.

After fitting the models with data that does not contain outliers here’s what we got.

import mathdef rmse(X,y): return math.sqrt(((X-y)**2).mean())def print_score(model):
    return [rmse(model.predict(X_train), y_train),rmse(model.predict(X_test), y_test), 
     model.score(X_train, y_train), model.score(X_test, y_test)]print('The scores for Stochastic Gradient Descent are: ', print_score(SGD))print('The scores for Random Forest are: ',print_score(rf))>>The scores for Stochastic Gradient Descent are: [0.33437466833442886, 0.33492309110522206, 0.8814293026043725, 0.8808833158341957]>>The scores for Random Forest are:  
[0.2521855459326567, 0.25436238295812885, 0.93255480422885, 0.9312950253266471]

The first two indexes tell us the error of the model performed on the training and testing dataset respectively. Likewise, the last two indexes tell us the accuracy of the model performed on the same.

Hence we can see that there is no overfitting or underfitting. Which is a good sign.

Scaling The Outliers

Scaling is a technique where the data points are brought to a common range of values, where the shape of the distribution remains the same.

So, why scaling?

The primary task of the Machine Learning model is to find a value for each feature such that they could converge faster to reach a unique solution. If the numbers are larger then finding that particular value becomes harder for the machine learning model. Hence we try to bring them to a range such that finding that unique number is easier and faster.

Scaling is most effective when the numbers are bigger and the population of the same has a lot of variance in it.

from sklearn.model_selection import train_test_splitX, y = data_raw.carat, np.log(data_raw.price)from sklearn.preprocessing import RobustScalerscaling = RobustScaler()X = scaling.fit_transform(np.array(X).reshape(-1,1))X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=23, test_size=0.3)X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)

I am using Robust Scalar because of the outliers present in the dataset. Robust scalar reduces the effect of the outlier because it uses the interquartile range.

After fitting the models with scaled data here’s what we got.

print('The scores for Stochastic Gradient Descent are: ',print_score(SGD))print('*'*20)print('The scores for Random Forest are: ',print_score(rf))>>The scores for Stochastic Gradient Descent are:  [0.3968648110452676, 0.3995578537520062, 0.8475738040346006, 0.8435787127417848]
********************
>>The scores for Random Forest are:  [0.2505442962295736, 0.2532557461978344, 0.9392503939398784, 0.9371572577031487]

Although the scores have dropped for Stochastic Gradient Descent Random forest the scores bumped up by a fraction. This may be because the variance of the input data was not high. Most of the input data were at the range of 0–2 hence the after scaling it didn’t make that much sense. But we know for sure that after scaling the values lingers around the interquartile range.

Neural Networks

Neural Networks are state of art mathematical models inspired by the functionality of the brain to make predictive models and to find the underlying structure of the data.

We will be building two Neural Networks. Both the Deep Learning model will differ in terms of the activation function. We will compare how they perform against each other.

Again, the goal here is to see which model is robust to outliers. We will not be performing any of the preprocessing techniques that we performed earlier.

The outline of the model is very simple. It will layer with 300 neurons which will have a Sigmoid activation function.

model = keras.models.Sequential([
    keras.layers.Dense(300, activation="sigmoid", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])
model.compile(loss="mean_squared_error", optimizer=keras.optimizers.SGD(lr=1e-3))
history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)

After 20 iterations the vanilla neural network with sigmoid activation performed quite well. There was very little overfitting which is negligible.

Coming to the next deep learning model with the LeakyRelu activation function this is what the results showed. The model with the LeakyRelu activation function learned faster and performed better than the model with Sigmoid function activation.

Conclusion

Removing the outliers can cause a reduction of observation causing the model to lose important data points which can be vital to make a relationship between the independent variable and the dependent variable.
Scaling is a better option when it comes to reducing the effects of outliers. In that also we should use Robust Scalar because it works best when it comes to working with outliers.
Other scaling methods are also available there like Standard Scalar, Min-Scalar, Normalisation and so forth. Read this article to know more about Scaling methods.
We also found that the Parametric model worked better when the outliers were removed and not scaled, mostly because most of the values are clustered in the range of 0–2 making it less effective to feature scaling.
The non-parametric model showed some different results it bumped up by a fraction, this can be because the non-parametric relies more on pruning of the features rather than assigning weights to the features.
Deep Neural Networks are less prone to outliers, especially if we take care of the activation functions.
Deep Neural Networks undergo multiple iterations that allow them to correct the weights and then apply it. This is more effective when there are lots of nodes with each node contributing to get one unique solution. This makes Deep Neural Networks more robust to outliers.

Note: Our aim was not to design a complex model but to design and build a model to understand the effects of removing outliers, scaling the outliers, and also to obverse which models make the best relationship between the independent and dependent variables.