Understanding Dropout
When training deep neural networks, we always encounter two major shortcomings:
(1) easy overfitting
(2) time-consuming
Dropout can effectively alleviate the occurrence of overfitting and achieve the regularization effect to a certain extent.
Dropout net model is described in the paper《Dropout: A Simple Way to Prevent Neural Networks from Overfitting》, which is shown in Fig 1. On the left, it’s a normal fully connected neural net, and dropout net is on the right. In neural net, there’re two types of variation, parameters (also called weights) and activations. In Fig 1, the line is a multiplication of an activation times weights, and the activation is the sum of all of these inputs times all of these activations.
For dropout, we throw away some percentage of the activations randomly, so all of the weights that were connected to it are also gone. In Fig 1, the activations in the input layer and hidden layer are deleted randomly, but normally, we only delete activations in the hidden layer.
When training a neural net, there’s training time and test time. In training time, weight updates and backpropagation, and dropout worked well. At test time, we turn off dropout. If the P-value is 0.5 in training test, it means half of the activations are removed, so in the test time, the activations will be twice than it used to be. In the paper, it suggested multiplying all of your weights at test time by p, which is shown in Fig 2.
In each training batch, overfitting can be significantly reduced by ignoring some of the activations, and then the next mini-batch, putting them back and throw away some different ones. The dropout percentage depended on the probability p, which usually is 0.5. The reason is that dropout generates the most random network structures at 0.5. This approach can reduce the interaction between activations, which means that some activations rely on other activations to function.
There’re two reasons that dropout works well.
(1) Reduces complex interaction between neurons. Because dropout made two neurons do not always appear in a dropout network. In this way, weight updating is no longer dependent on the joint action of implicit nodes with fixed relations, which prevents some features from being effective only under other specific features. Forcing the network to learn more robust features that also exist in random subsets of other neurons. Dropout is a bit like L1, L2 regularization, and reduction of weight makes the network more robust to the loss of specific neuron connections.
(2) The average effect. Dropout different hidden neurons are similar to training different networks. The random deletion of half of the hidden neurons results in different network structures. The whole dropout process is equivalent to averaging many different neural networks.
It also should be noticed that too much dropout is reducing the capacity of the model, and it would result in underfitting. It’s necessary to place different dropout values for each layer. In code, you can pass in a list, and it will be applied in each layer. In CNN, it’s a little different, if you pass an int, it will use that for the last layer and half of that value for the earlier layer.