Aman Oberoi and Nisha McNealis | UCLA ACM AI
What is Dropout?
“Dropout” in machine learning refers to the process of randomly ignoring certain nodes in a layer during training.
In the figure below, the neural network on the left represents a typical neural network where all units are activated. On the right, the red units have been dropped out of the model — the values of their weights and biases are not considered during training.
Dropout is used as a regularization technique — it prevents overfitting by ensuring that no units are codependent (more on this later).
Other Common Regularization Methods
When it comes to combating overfitting, dropout is definitely not the only option. Common regularization techniques include:
- Early stopping: stop training automatically when a specific performance measure (eg. Validation loss, accuracy) stops improving
- Weight decay: incentivize the network to use smaller weights by adding a penalty to the loss function (this ensures that the norms of the weights are relatively evenly distributed amongst all the weights in the networks, which prevents just a few weights from heavily influencing network output)
- Noise: allow some random fluctuations in the data through augmentation (which makes the network robust to a larger distribution of inputs and hence improves generalization)
- Model combination: average the outputs of separately trained neural networks (requires a lot of computational power, data, and time)
Despite the plethora of alternatives, dropout remains an extremely popular protective measure against overfitting because of its efficiency and effectiveness.
How Does Dropout Work?
When we apply dropout to a neural network, we’re creating a “thinned” network with unique combinations of the units in the hidden layers being dropped randomly at different points in time during training. Each time the gradient of our model is updated, we generate a new thinned neural network with different units dropped based on a probability hyperparameter p. Training a network using dropout can thus be viewed as training loads of different thinned neural networks and merging them into one network that picks up the key properties of each thinned network.
This process allows dropout to reduce the overfitting of models on training data.
This graph, taken from the paper “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” by Srivastava et al., compares the change in classification error of models without dropout to the same models with dropout (keeping all other hyperparameters constant). All the models have been trained on the MNIST dataset.
It is observed that the models with dropout had a lower classification error than the same models without dropout at any given point in time. A similar trend was observed when the models were used to train other datasets in vision, as well as speech recognition and text analysis.
The lower error is because dropout helps prevent overfitting on the training data by reducing the reliance of each unit in the hidden layer on other units in the hidden layers.
These diagrams taken from the same paper show the features learned by an autoencoder on MNIST with one layer of 256 units without dropout (a) and the features learned by an identical autoencoder that used a dropout of p = 0.5 (b). It can be observed in figure a that the units don’t seem to pick up on any meaningful feature, whereas in figure b, the units seem to have picked up on distinct edges and spots in the data provided to them.
This indicates that dropout helps break co-adaptations among units, and each unit can act more independently when dropout regularization is used. In other words, without dropout, the network would never be able to catch a unit A compensating for another unit B’s flaws. With dropout, at some point unit A would be ignored and the training accuracy would decrease as a result, exposing the inaccuracy of unit B.
How to Use Dropout
A CNN without dropout could be represented by code similar to this:
To add a dropout layer, a programmer could add a line like this:
The first parameter, circled in orange, is the probability p that a given unit will drop out. In this example, the probability is 0.5, which means that roughly half of the given units will drop out. The value 0.5 has been experimentally determined to be close to the optimal probability for a wide range of models, but feel free to experiment with other probabilities!
Adjusting Weights During Testing
Since dropout removes some of the units from a layer, a network with dropout will weigh the remaining units more heavily during each training run to compensate for the missing inputs. However, at test time, it is not feasible to use the weights of the trained model in their exaggerated states and so each weight is scaled down by multiplying by the hyperparameter p. This phenomenon can be observed in the example below.
Let’s look at a network with four units in a layer (image a). The weight on each unit will initially be ¼ = 0.25.
If we apply dropout with p = 0.5 to this layer, it could end up looking like image b. Since only two units are considered, they will each have an initial weight of ½ = 0.5. However, dropout is only used in training, so we don’t want these weights to be fixed at this high a number during testing.
To fix this issue, when we move to the testing stage we multiply the weights by p (as seen in the image below), ending up with 0.5*0.5 = 0.25, the correct initial weight.
Hyperparameters in Dropout Regularization
Hyperparameter settings that have been found to work well with dropout regularization include a large decaying learning rate and a high momentum. This is because restricting our weight vectors using dropout enables us to use a large learning rate without worrying about the weights blowing up. The noise produced by dropout coupled with our large decaying learning rate helps us explore different regions of our loss function and hopefully reach a better minimum.
The Downside of Dropout
Although dropout is clearly a highly effective tool, it comes with certain drawbacks. A network with dropout can take 2–3 times longer to train than a standard network. One way to attain the benefits of dropout without slowing down training is by finding a regularizer that is essentially equivalent to a dropout layer. For linear regression, this regularizer has been proven to be a modified form of L2 regularization. For more complex models, an equivalent regularizer has yet to be identified. Until then, when in doubt: dropout.
Try it Yourself!