Improving neural networks by preventing co-adaptation of feature detectors

4 min readMay 7, 2018

This blog post aims to provide readers some insights on deep neural networks and intuition about dropout technique.

Deep Neural Networks

Deep neural networks are models composed of multiple layers of simple, non-linear neurons. With composition of enough neurons, the model can learn extremely complex functions that can accurately perform complicated tasks that are impossibly difficult to hard code, such as image classification, translation, speech recognition, etc. The key aspect of deep neural networks is that they are able to automatically learn data representation needed for features detection or classification without any a priori knowledge¹.

For example, VGG16 (shown below) is a convolutional neural network that is trained on ImageNet Large Scale Visual Recognition Competition (ILSVRC) data. It has 16 layers and 138 millions parameters and is capable to classify 1000 different objects with high accuracy.

Often time, models with such complexity often can have exponentially many combinations of active neurons that can achieve high performance on the training set, but not all of them are robust enough to generalized well on unseen data (or testing data). This problem is also known as overfitting.

Co-adaptation and Dropout

One of the most prominent reasons for causing overfitting is co-adaptation. According to wiki, at genetic level, co-adaptation is the accumulation of interacting genes in the gene pool of a population by selection. Selection pressures on one of the genes will affect its interacting proteins, after which compensatory changes occur. So in neural network, co-adaptation means that some neurons are highly dependent on others. If those independent neurons receive “bad” inputs, then the dependent neurons can be affected as well, and ultimately it can significantly alter the model performance, which is what might happen with overfitting.

The simplest solution to overfitting is dropout, and works as followed: each neuron in the network is randomly omitted with probability between 0 and 1.

By implementing such technique, the model is forced to have neurons that can learn good features (or data representation needed) while not relying on other neurons. Therefore, the resulting model can be more robust to unseen data. Hinton also makes an interesting observation on similarity between dropout and the role of sex in evolution:

One possible interpretation of mixability articulated in is that sex breaks up sets of co-adapted genes and this means that achieving a function by using a large set of co-adapted genes is not nearly as robust as achieving the same function, perhaps less than optimally, in multiple alternative ways, each of which only uses a small number of co-adapted genes. This allows evolution to avoid dead-ends in which improvements in fitness require co-ordinated changes to a large number of co-adapted genes. It also reduces the probability that small changes in the environment will cause large decreases in fitness a phenomenon which is known as “overfitting” in the field of machine learning.

Additionally, dropout makes possible to train very deep neural networks such as VGG16 in a reasonable time because a large of amount computations are reduced as many neurons are omitted from the network.

Empirical Evidence

Hinton conducts a number of experiments to validate the effectiveness of dropout. Below is one of the experiments using MNIST data set, which consists of 60,000 28x28 images of handwritten digits.

The best previously published result for this task using backpropagation without pre-training or weight-sharing or enhancements of the training set is shown as a horizontal line. The upper set of lines are test error rate on the MNIST test set for a variety of neural network architectures trained 50% dropout for all hidden layers. The lower set of lines also use 20% dropout for the input layer.

Below is the result of another experiment using TIMIT, a widely used benchmark for recognition of clean speech with a small vocabulary.

The frame classification error rate on the core test set of the TIMIT benchmark. Comparison of standard and dropout finetuning for different network architectures. Dropout of 50% of the hidden units and 20% of the input units improves classification.

Conclusion

Try using dropout when overfitting persists.

References

Y. LeCun, Y. Bengio & G. E. Hinton, Nature 521, 436–444 (2016).
G. E. Hinton, et al. Improving neural networks by preventing co-adaptation pf feature detectors https://arxiv.org/abs/1207.0580 (2012)