Regularization: Batch-normalization and Drop out

Batch normalization and dropout act as a regularizer to overcome the overfitting problems in a Deep Learning model.

aditi kothiya
Analytics Vidhya
5 min readJul 16, 2020

--

Source

Have you come across a large dataset that causes overfitting?

One of the reasons for overfitting is large weights in the network. A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called regularization.

In this article, we will discover how weight regularization will help to train networks faster, reduce overfitting, and make better predictions with deep learning models.

There are techniques which are used for regularization which is mentioned below

  • Batch Normalization
  • Drop out layer

Let’s talk about batch normalization first,

Batch Normalization

Batch normalization is a technique for improving the speed, performance, and stability of artificial neural networks, also known as batch norm. The idea is to normalize the inputs of each layer in such a way that, they have a mean activation output zero and a unit standard deviation.

Why should we normalize the input?

Let say we have 2D data, X1, and X2. X1 feature has a very wider spread between 200 to -200 whereas the X2 feature has a very narrow spread. The left graph shows the variance of the data which has different ranges. The right graph shows data lies between -2 to 2 and it’s normally distributed with 0 mean and unit variance.

source

Essentially, scaling the inputs through normalization gives the error surface a more spherical shape, where it would otherwise be a very high curvature ellipse. Having an error surface with high curvature will mean that we take many steps that aren’t necessarily in the optimal direction. When we scale the inputs, we reduce the curvature, which makes methods that ignore curvature like gradient descent work much better. When the error surface is circular or spherical, the gradient points right at the minimum.

oscillations of gradient descent is large due to high curvature areas of the objective landscape
source
oscillations of gradient descent is moderate due to spherical areas of the objective landscape
Source

Why is it called batch normalization?

In Batch normalization just as we standardize the inputs, the same way we standardize the activation at all the layers so that, at each layer we have 0 mean and unit standard deviation.

This is how we can normalize the activation at all layers.

Final Activation layer will be,

Where parameter γ and β are learned along with other parameters of the network, it will also contribute to the loss because h_final depends on γ and β which will also contribute to the loss.

Now if γ(gamma) becomes σ(standard deviation) and β(beta) becomes μ(mean) then the equation becomes,

σ = 1 and μ = 0 then we can say that,

So the network has the flexibility in both the direction,

  • If normalization helps to reduce the error loss, then inputs are (gamma)γ = 1 and (beta)β = 0.
  • Otherwise learnable parameter γ and β will learn such a way that the overall loss will decrease.

How does batch normalization work as a regularizer?

So we have computed mean and standard deviation from a mini-batch, not from the entire data. So at every layer, we are adding noise and noise has a non-zero mean and non-unit variance, and is generated at random for each layer. It is then added after the batch normalization layers to deliberately introduce a covariate shift into activation, it acts as a regularizer. So with less information network has to predict the right labels. And also makes the model more robust.

Ensemble Method:

Ensemble learning helps improve machine learning results by combining several models. Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model to decrease variance (bagging), bias (boosting) or improve predictions.

Idea of Dropout

Training one deep neural network with large parameters on the data might lead to overfitting.

Can we train multiple neural networks with different configurations on the same dataset and take the average value of these predictions?

source

But ensembles of neural networks with different model configurations are known to reduce overfitting, but require the additional computational expense of training and maintaining multiple models.

So there comes the drop out layer in the picture,

A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and effective regularization method to reduce overfitting and improve generalization error.

source

During training, some number of layer outputs are dropped out with certain probability p. This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer. In each update to a layer during training is performed with a different view of the configured layer.

Conclusion

  1. Batch normalization regularizes gradient from distraction to outliers and flows towards the common goal (by normalizing them) within a range of the mini-batch. Resulting in the acceleration of the learning process.
  2. A dropout is an approach to regularization in neural networks which helps to reduce interdependent learning amongst the neurons.

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — PadhAI.

Feel free to comment if you have any feedback for me to improve on, or if you want to share any thoughts or experience on the same.

Do you want more? Follow me on LinkedIn, and GitHub.

--

--

aditi kothiya
Analytics Vidhya

Data science | Deep Learning | Computer Vision | Machine learning