Dropout in (Deep) Machine learning

This blog post is also part of the series of Deep Learning posts. I wrote two other posts before — one on Weight Initialization and another one on Language Identification of Text. All the posts are self contained, so you can go ahead and read this one and check-out the others if you like to.

In this post, I will primarily discuss the concept of dropout in neural networks, specifically deep nets, followed by an experiments to see how does it actually influence in practice by implementing a deep net on a standard dataset and seeing the effect of dropout.

What is Dropout in Neural Networks?

According to Wikipedia
The term “dropout” refers to dropping out units (both hidden and visible) in a neural network.

Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. By “ignoring”, I mean these units are not considered during a particular forward or backward pass.

More technically, At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.

Why do we need Dropout?

The answer to these questions is “to prevent over-fitting”.

A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data.

Dropout — Revisited

In machine learning, regularization is way to prevent over-fitting. Regularization reduces over-fitting by adding a penalty to the loss function. By adding this penalty, the model is trained such that it does not learn interdependent set of features weights. Those of you who know Logistic Regression might be familiar with L1 (Laplacian) and L2 (Gaussian) penalties.

Dropout is an approach to regularization in neural networks which helps reducing interdependent learning amongst the neurons.

Training Phase:

Testing Phase:

Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from
overfitting”, JMLR 2014

Some Observations:

  1. Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.
  2. With H hidden units, each of which can be dropped, we have
    2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.

Experiment in Keras

I took ReLU as the activation function for hidden layers and sigmoid for the output layer (these are standards, didn’t experiment much on changing these). Also, I used the standard categorical cross-entropy loss.

Finally, I used dropout in all layers and increase the fraction of dropout from 0.0 (no dropout at all) to 0.9 with a step size of 0.1 and ran each of those to 20 epochs. The results look like this:

Accuracy vs Dropout (Left) and Loss vs Dropout

From the above graphs we can conclude that with increasing the dropout, there is some increase in validation accuracy and decrease in loss initially before the trend starts to go down.
There could be two reasons for the trend to go down if dropout fraction is 0.2:

  1. 0.2 is actual minima for the this dataset, network and the set parameters used
  2. More epochs are needed to train the networks.

For the code on the Keras experiment, please refer to the jupyter notebook hosted on github.

If you would like to know more about me, please check my LinkedIn profile.