Understanding Dropout in Deep Neural Networks

Published in

CodeX

6 min readJan 6, 2021

This article aims to provide an understanding of a very popular regularization technique called Dropout. It assumes a prior understanding of concepts like model training, creating training and test sets, overfitting, underfitting, and regularization.

The article starts with setting the context for and making the case for dropout. It then explains how dropout works and how it affects the training of deep neural networks. Finally, it goes over Keras’s dropout layer to how to use it.

1. Background

Deep neural networks are heavily parameterized models. Typically, they have tens of thousands or even millions of parameters to be learned. These parameters provide a great amount of capacity to learn a diverse set of complex datasets. This is not always a good thing. Such a capacity often leads to overfitting, a scenario where the training set performance is high and the test set performance is worse (low bias, high variance). The model is likely to have a higher test error rate because it’s too dependent on the training data. To avoid this situation, we try to reduce the learning capacity of the model using various regularization techniques. One such regularization technique is Dropout. Regularization ensures the model generalizes well on the unseen data.

Fig. 1. The contrast between good fit and overfitting. Source: Wikipedia

Fig. 1 shows the contrast between an overfitted model represented by the green margin and a regularized model represented by the black margin. Even though the green margin seems to better fit the training data, it’s not likely to perform well enough on unseen instances (test set). Fig. 1 provides a decent picture of what overfitting looks like.

2. Enter Dropout

It’s one of the most popular techniques of regularization, proposed by Geoffrey Hinton, 2012 in the paper “Improving neural networks by preventing co-adaptation of feature detectors”. It’s a fairly simple idea but a very potent one.

At every training step, each neuron is assigned a probability ‘p’, to temporarily not participate in the training process (‘dropping out’). Here, ‘p’ is hyperparameter called dropout rate, that can be tuned.

Fig. 2. The above image shows how implementing dropout affects the network connections. (Left) Standard feed-forward network with dense connections. (Right) The number of connections is drastically cut down due to dropout. Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” paper.

For instance, if p=0.5, it implies a neuron has a 50% chance of dropping out in every epoch. If a neuron doesn’t participate in a training step, all of its connections are severed, which will impact the downstream layers. This will drastically reduce the density of connections in the neural network (shown in Fig. 2). Dropout can be applied to the input and the hidden layers but not to the output layer. This is because the model has to always generate output for the loss function to enable training. The dropout process is only carried out during the training phase. All the neurons in the network fully participate during the inference phase.

It may be astonishing that turning off neurons arbitrarily works at all. It’s reasonable to assume that it may make the training process highly unstable. But it’s been practically proven to be very effective in reducing the complexity of the model. To understand why I would like to quote an example from the book “Hands-On Machine Learning with Scikit-Learn and TensorFlow”.

Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would obviously be forced to adapt its organization; it could not rely on any single person to fill in the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks.

Similarly, in deep neural networks, at each epoch during the training process, the network architecture is different from the previous one. Also, each neuron is forced not to be too reliant on a few input connections, but rather to pay attention to all of its inputs. This makes them more resilient to changes in the input connections. This way it ensures a more robust network that generalizes better.

The tunable hyperparameter in dropout is the dropout rate, denoted by p. Tuning it is fairly straightforward.

Increase p when your model is overfitting
Decrease it when your model is underfitting
Keep it high for large layers and low for small ones

3. Demonstrating the Effect of Dropout

Fig. 3. Effect of dropout on the loss function of the network trained on MNIST dataset

Fig. 4. Effect of dropout on the accuracy of the network trained on MNIST dataset

The effect of dropout can be clearly seen in the above graphs (Fig. 3 & 4). This is from a simple experiment using Keras where a feed-forward neural network is trained on the MNIST dataset with and without dropout keeping all the other factors constant. The blue lines indicate the model with dropout and the orange lines indicate the model without dropout. In Fig. 3, “Effect of Dropout on Accuracy”, it can be clearly observed that dropout increased the loss of the model. It’s not necessarily a bad thing but it may take longer to converge. Consequently, the accuracy is dropped in Fig. 4. The network architecture used in the experiment is given below.

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                multiple                  401920    
_________________________________________________________________
dropout (Dropout)            multiple                  0         
_________________________________________________________________
dense_1 (Dense)              multiple                  131328    
_________________________________________________________________
dropout_1 (Dropout)          multiple                  0         
_________________________________________________________________
dense_2 (Dense)              multiple                  32896     
_________________________________________________________________
dropout_2 (Dropout)          multiple                  0         
_________________________________________________________________
dense_3 (Dense)              multiple                  1290      
=================================================================
Total params: 567,434
Trainable params: 567,434
Non-trainable params: 0
_________________________________________________________________

Github: The code that generated the above graphs is available here.

4. Keras Implementation

Keras provides a dropout layer using tf.keras.layers.Dropout. It takes the dropout rate as the first parameter. You can find more details in Keras’s documentation. Below is a small snippet showing the use of dropout from the Hands-on ML book.

5. Other Regularization Techniques

In addition to dropout, other regularization techniques can also be applied to neural networks. Some of the most popular ones are listed below.

l₁ and l₂ Regularization
Early Stopping
Other variants of dropout (like Monte-Carlo dropout)

6. References

Below are the references I used to write this article. The original papers (in the below list) on Dropout deal with the theory behind it and the experiments conducted to prove its effectiveness in great detail.

Book: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
Github Repository: https://github.com/mvshashank08/dropout
Paper: “Improving neural networks by preventing co-adaptation of feature detectors”
Paper: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”
Wikipedia: Overfitting

Thank you for your time. Please leave any suggestions in the comments section.