Understanding Deep Learning requires rethinking Generalization

Published in

Analytics Vidhya

4 min readAug 6, 2020

In this article, I will brief you with details of the paper “ Understanding deep learning requires rethinking generalization ” by Chiyuan Zhang (link here).

I consider you to be well acquainted with the Convolutional Neural Network (CNN). If not then I suggest you to read it from wikipedia .

The paper “Understanding Deep Learning Requires Rethinking Generalization” caused quite a stir in the Deep Learning and Machine Learning research communities. It’s the rare paper that seems to have high research merit — judging from being awarded one of three Best Paper awards at ICLR 2017 — but is also readable. Hence, it got the most amount of comments of any ICLR 2017 submission on OpenReview.

Key TakeAway points

Deep neural networks easily fit random labels.
Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.

But what is generalization and generalization curve?

As per developers.google.com , “generalization refers to your model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model” . In simple term it is difference between “training error” and “testing error”.

A loss curve showing both the training set and the validation set. A generalization curve can help you detect possible overfitting. For example, the following generalization curve suggests overfitting because loss for the validation set ultimately becomes significantly higher than for the training set. (refer)

But what is Regularization?

Regularization applies a penalty on a model’s complexity thus preventing overfitting. It does play an interesting role in deep learning. It works like a tuning parameter to reduce the final test error of a model. (refer)

Regularization are of 2 types:

Explicit Regularization (weight decay, dropout and data augmentation)
Implicit Regularization (early stopping, batch normalization and SGD)

“Deep neural networks easily fit random labels”

They performed Randomization tests in which they trained several pretrained models on a copy of the data where the true labels were replaced by random labels and surprisingly, neural networks achieve 0 training error.

They also established this fact for several different standard architectures trained on the CIFAR10 and ImageNet classification benchmarks and concluded that :

The effective capacity of neural networks is sufficient for memorizing the entire data set.
It has no significant changes in training time.
Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

Extending on these experiments, they also replaced the true images by completely random pixels (e.g., Gaussian noise) and observed that convolutional neural networks continue to fit the data with zero training error.

This shows that despite their structure, convolutional neural nets can fit random noise. As we increase the noise level neural networks are able to capture the remaining signal in the data, while at the same time fit the noisy part using brute-force.

shows the training loss of various experiment settings decaying with the training steps.

Explicit regularization

“Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.”

Explicit forms of regularization, such as weight decay, dropout, and data augmentation do not explain the generalization error of neural networks. They found that regularization plays a rather different role in deep learning. It appears to be more of a tuning parameter that often helps improve the final test error of a model, but the absence of all regularization does not necessarily imply poor generalization error.

Implicit Regularization

“For linear models, SGD always converges to a solution with the small norm. Hence, the algorithm itself is implicitly regularizing the solution.”

It simply leads us to the conclusion that there is more understanding needed for how SGD affect a model while regularizing it, and how it affects other properties of a model.

The training and test accuracy (in percentage) of various models on the CIFAR10 dataset. Performance with and without data augmentation and weight decay are compared. The results of fitting random labels are also included.

To conclude, from my growing experience with Deep Learning, I find their experimental results surprising. Perhaps it will be useful as a starting point to understanding generalization in Deep Learning.

Thanks for reading, if you like the story then do give it a clap.

Connect with me on linkedIn : https://www.linkedin.com/in/kalp-panwala-72284018a

Follow me on twitter : https://twitter.com/PanwalaKalp

Follow me on github : https://github.com/kpanwala

If you have any queries regarding the story or any room for improvements then mail me on kpanwala33@gmail.com .