Explaining Papers

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

The article explains the paper ‘Dropout’ by Srivastava et al. (2014)

Chaitanya Belhekar

Published in

Analytics Vidhya

6 min readSep 22, 2020

We will be learning a technique to prevent overfitting in neural network — dropout by explaining the paper, Dropout: A Simple Way to Prevent Neural Networks from Overfitting. The paper was published by Srivastava et al. in 2014.

Deep neural networks are very powerful machine learning systems, but they are prone to overfitting. Large neural nets trained on relatively small datasets can overfit the training data. This is because the model learns the statistical noise in the training data, which results in poor performance when the model is evaluated on a test dataset. Dropout is a technique for addressing this problem. The key idea is to randomly drop nodes (along with their connections) from the neural network during training. This prevents nodes from co-adapting too much.

Dropout is created as a regularization technique, that we can use to reduce the model capacity so that our model can achieve lower generalization error. It approximates training a large number of neural networks with different architectures in parallel. During training, some number of layer nodes are randomly ignored or “dropped out”.

By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in the figure.

Regularization

Various techniques are used to avoid overfitting, one being regularization — L1 & L2 regularization. We use regularization techniques to lower the model capacity so that we can reduce the gap between training and testing error. Regularization is adding a regularization term to the loss function, which incorporates a measure of model complexity into the function to be minimized.

Using regularization, we become biased toward simpler models, on the basis that they are capturing something more fundamental, rather than some artifact of the specific data set. — Tim Roughgarden

L2 regularization in a neural network is related to the concept of weight decay. A more detailed intuition of L2 regularization is presented here: Understanding the scaling of L² regularization in the context of neural networks.

Model Ensemble

Another technique used to reduce generalization error is by combining several different models, which is often called as model ensemble. This makes sense because, in one model, we can have errors in one part of the test data, while another model has errors in another part of the test data. Thus, by combining several models we can get a more robust result since the parts that are already correct in most models won’t change and the error will be reduced. However, training many different models is hard and training each large network requires a lot of computation. Moreover, large networks normally require large amounts of training data and there may not be enough data available to train different networks on different subsets of the data.

Dropout incorporates both these techniques. It prevents overfitting and provides a way of approximately combining exponentially many different neural network models efficiently.

It might seem crazy to randomly remove nodes from a neural network to regularize it. But it is proven to greatly improve the performance of neural networks. So, why does it work so well?

Dropout means that the neural network cannot rely on any input node, since each node has a random probability of being removed. Therefore, the neural network will be reluctant to give high weights to certain features, because they might disappear.

Training Phase

The intuition for the dropout training phase is quite simple. We turn off some nodes of the neural network at training time to make the network architecture different at each training iteration. The way we turn off the nodes can be viewed in the following figure. We multiply each input y(l) with a node r(l) that is a two-points distribution that outputs either 0 or 1 with Bernoulli distribution.

Here, r(l) is a vector of Bernoulli random variables each of which has a probability (dropout rate) p of being 1. This vector is sampled and multiplied element-wise with the outputs of that layer y(l) to create the thinned outputs y-dash(l).

Here in the second line, we can see we add a neuron r which either keep the node by multiplying the input with 1 with probability p or drop the node by multiplying the input with 0 with probability 1-p, then do the same forward pass as without dropout.

A new hyperparameter (dropout rate) p is introduced that specifies the probability at which outputs of the layer are dropped out, or inversely, the probability at which outputs of the layer are retained.

In the simplest case, each unit is retained with a fixed probability p independent of other units, where p can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks.

Testing Phase

Now, in the testing phase, we can use the method of averaging as model ensemble, that is run the test data in all the possible neural networks, and average the results. However, it is not feasible and computationally expensive to explicitly average the predictions from exponentially many models. Hence, the paper proposes getting the average by implementing the neural network without the dropout.

Dropout is not used while making a prediction, doing this results in making the weights of the network larger than normal. Therefore, before finalizing the network for testing, the weights are first scaled by the chosen dropout rate. This method is called weight scaling.

If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time.

Activation of node based on probability (p) at training time and, scaling of weights by the same probability when the node is always present at test time— Srivastava et al. (2014)

In practice, the rescaling of the weights can be performed at training time instead of testing time. This is called inverse dropout, where the outputs are scaled down by the dropout rate.

We are not touching all the experimental results showcased in the paper, but it assures that the dropout technique improves generalization performance on all data sets compared to neural networks that did not use dropout. Feel free to dive into details of the working of dropout on all data sets in the paper.

Do read Appendix A: A Practical Guide for Training Dropout Networks from the paper.

References:

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” (2014)
A Gentle Introduction to Dropout for Regularizing Deep Neural Networks