As it is considered that Neural Networks are block boxes when it comes to understanding the mathematical function which maps input variables to outputs which is quite complex to derive.Unfortunately the same happens with handling of outliers by Neural Networks as well.

Before getting into this post my advice for you is to have some basic understanding of Neural Network and certain aspects of Machine Learning so as to get complete picture of this blog.

In case you are new to Neural Networks Refer these :

I always wondered how Neural Networks deal with outliers especially when we use Rectified Linear Unit (ReLU) as an activation function.You may ask why only ReLU and not other activation units like Sigmoid, Tanh and arctan etc.Well the main reason is ReLU has semi linear property which can be affected by presence of outliers and secondly it is most widely used as of 2019.

Why I am focusing only on Relu?

For the answer we should look at a concept called Squashing in Logistic regression.Lets first understand how logistic regression works and handles outliers by using Sigmoid activation function .I am taking logistic regression into consideration because single Neuron(Perceptron) is same as Logistic Regression if sigmoid activation is used else it is linear regression in absence of any activation function.

Logistic regression and Single Neuron(Perceptron)

Logistic regression is a classification technique which means it tries to differentiate between two classes i.e YES or NO ,ZERO or ONE etc and it can be made to differentiate between multiple classes as well i.e ZERO or ONE or TWO or THREE.

Example :

1.Two class:Animal is Cat or dog.

2.Multi-class : Weather is Sunny or Rainy or Windy.

How logistic regression works

Speaking broadly it tries to draw a lines/planes to differentiate between two or more categories .I am going to explain logistic regression in a simple possible way which is enough to understand this whole blog.If in case you want more details about logistic regression click here.

Objective:Finding a line or plane ‘ w’ which decreases error.

Working:Consider an example of 2-class Classification problem with features heights and weights .Task is to determine whether given animal is cat or dog.

Label cat as -1 and dog as +1.Output y=+1 or -1 and inputs x(vector)=weights and heights

Assumption is that given data is linearly separable .

Lets understand logistic regression in geometrical point of view which is easy to grasp.Plot the weights and heights .

Looking at the plots anyone can say that points with higher weights and heights are dogs (marked as orange circles) and those with lesser weights and heights are cats(marked as red circles).But how logistic regression solves this?Simply draw a line to divide both of the classes.

After drawing the line we can easily say that points above the line are dogs and below are cats.

Lets get into simplified procedure.

step1:Logistic regression initially draws a random line in space.

step2 :Calculates distance from every point to line and updates itself such that resulting output distance is always maximum.

Distance (d) from point to plane is wx where w is normal to the plane and x is input vector. Points above the plane will have positive distance i.e wx=+ve and points below the plane will have negative distance i.e wx=-ve.

Mathematical equation becomes:

Where w is line /plane ,y is output and x is input vector,summation(i to n) implies for all the input data points and argmax(w) implies we want maximum distance.

If all points are correctly classified then the resultant summation will be more positive and as compared to situation where there are some misclassifications.

Logistic regression in case of outliers

The above equation results for 100 percent accuracy until there are no outliers or extreme points and no misclassifications .Take a look at below image when outliers are introduced to data and thus shift in line/plane.

When we apply same equation, the resulting plane is what we got in above image which is making up 3 misclassification to maximize distance.To handle this problem we should include concept introduced earlier in this blog called squashing.Squashing is a phenomenon to decrease the impact of extremities/outliers .Due to the effect of squashing the line/plane is less impacted by outliers and thus reducing misclassifications.

From this image(left image after squashing) we can see only 1 point is mis-classified which is better than what we got above.Effect of squashing comes from the underlying mathematical function called ‘SIGMOID’ .

Sigmoid f(x) lies between 0 and 1 for all values of x .When we apply sigmoid on distances we try to have balance between both outliers which are located at extremities and normal points ,so we have less impact of outliers on line /plane being adjusted resulting in less mis-classifications. This is what Squashing all about.

This is the equation after applying sigmoid which is less prone to outliers.


Before that let us look at ReLU

As we can see that for any values of x≥0 ,f(x) is linearly dependent on x and zero otherwise.

However when we apply Relu to above problem the resulting line/plane is similar as what we got in Logistic regression without Squashing because it just resulting a linear function(wx) when wx≥0 and zero if wx<0.That means Relu cannot squash the impact of outliers or to be more precise extreme points.So we can say that ReLU is more prone to outliers than Sigmoid by the analysis we did so far.

Now let’s us come back to Neural Networks

Simply lets experiment with Neural Networks with Relu and sigmoid by taking a Regression Data-set.

I am using California Housing dataset (for details click here)

The reason for using this data set is that it has some extreme points or outliers so that we can perform our analysis better.This analysis is done in two stages

First stage :Applying various architectures of Neural network on data and test the resultant error .

Second stage: After removing outliers from the data then applying Neural network and test the resultant error.

Fetching the data set

from sklearn.datasets import fetch_california_housingd = fetch_california_housing()

Splitting dataset in train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(da, d.target,test_size = 0.30,)

Standardizing the data

from sklearn.preprocessing import StandardScalersc=preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)

Stage 1:Applying NN without removing outliers

Model 1:Architecture:Input-output layered NN (1–1 )

from keras.layers import Input, Dense
from keras.models import Model
model = Sequential()
model.add(Dense(1 ,activation='relu', input_shape=(8,)))
model.compile(optimizer='adam', loss='mean_squared_error')
history=model.fit(X_train,y_train, batch_size=32, epochs=600, validation_data=(X_test, y_test))

Model 1 with Relu as activation function.At end of training MSE(loss) on train is 0.50 and on test is 0.509.

Model 1 with sigmoid as activation function.At end of training MSE(loss) on train is 0.389 and on test is 0.402.

As expected the loss is higher with Relu when compared to Sigmoid due to the fact that we had not removed the outliers from the dataset .And due to Relu which does not has Squashing property is the reason behind the huge loss where sigmoid is doing well by squashing those outliers.

However, we only used single neurons in each layer and there were no hidden layers either in model 1.If we change the architecture by adding hidden layers the results may be different .Lets see..

Model 2:Architecture:Input- 2 hidden-output layered NN (64–32–16–1)

model = Sequential()
model.add(Dense(64, input_shape=(8,)))
model.compile(optimizer='adam', loss='mean_squared_error')
history=model.fit(X_train,y_train, batch_size=32, epochs=600 ,validation_data=(X_test, y_test))

Model 2 with Relu as activation function.At end of training MSE(loss) on train is 0.2334 and on test is 0.279.

Model 2 with sigmoid as activation function.At end of training MSE(loss) on train is 0.257 and on test is 0.309.

Now results are surprising since relu is performing better than sigmoid .What you think the reason behind this?Well, there are lot more to answer this question,let me put it in simplest way .When we go deep into Neural Networks the loss not only depends on outliers itself rather there are many aspects to consider ,most important one though is vanishing gradient problem which is mainly observed in sigmoid activation .Vanishing gradient is phenomenon in back propagation where the Neural networks does learn anything by just keeping its weights(wx) constant .

Sigmoid in logistic regression is mainly used for squashing but here in Neural networks that squashing function no more remains the same , now acts as activation function which helps in activating particular neuron.

The above image is the example of how less number of hidden layers and neurons which are making mapping function(blue line) to get impacted by outliers which mainly happens with Relu activation function and less on sigmoid. Having said,now lets get into stage 2.

Stage 2:Applying NN after removing outliers

Since we have less features we can analyze each feature individually by using BOX plots to detect outliers in the dataset.The image below shows box plots of 6 features from the dataset

We can see that points with yellow circles are outliers .So lets see if removing these points from the dataset can reduce MSE (loss).

Calculating percentiles for each features

print('99TH AND 100TH PERCENTILES OF FEATURE AVEBEDRMS:',np.percentile(da.AveBedrms, [99,100]))

Similarly for remaining features , if 99th percentile and 100th has large difference than select thresold as 99th percentile and remove remaining points.


After playing around removing extreme points which constitutes about 2 .4 percent of whole dataset, we can again split our data into train and test and standardize it .

Applying same architectures discussed above on this data.

Model 1 with Relu as activation function.At end of training MSE(loss) on train is 0.431 and on test is 0.395.

Model 1 with Sigmoid as activation function.At end of training MSE(loss) on train is 0.394 and on test is 0.353.

After removing outliers model 1 with Relu performed significantly better as compare to model 1 with relu in stage 1 and even model 1 with sigmoid has some improved performance due to the fact that sigmoid tend squash the impact of outliers and not completely eliminate their presence ,so that is what brings the change in loss.

Model 2 with Relu as activation function.At end of training MSE(loss) on train is 0.242 and on test is 0.246.

Model 2 with Sigmoid activation function.At end of training MSE(loss) on train is 0.258 and on test is 0.257.

Model 2 with relu seems to perform little better here which is mainly due to the fact that it converges faster than sigmoid and it is less prone to vanishing gradient problem.


From whole experimentation,Relu is impacted by outliers if Neural networks are not too deep .When architecture goes deep Relu behave same as other activation functions which even tends to regularize better and converges faster than others.

Full code : https://github.com/santoshketa/handling-outliers-in-Neural-in-nn-

