Understanding Deep Learning requires Re-Thinking Generalization

Published in

Analytics Vidhya

5 min readAug 16, 2020

The following analysis is my interpretation of the ICLR 2017 paper “Understanding Deep Learning requires Re-Thinking Generalization”(arXiv link). This paper was awarded one of the three Best Paper Award in ICLR 2017. I enjoyed reading the paper as it questions the conventional understanding of generalization in learning models and how regularization isn’t the only cause of generalization.

Key Findings

The paper provides an insight and a comprehensive experimental study showed that the concept of generalization that we have might be flawed. By random labelling, “Intuition suggests that this impossibility should manifest itself clearly during training, e.g., by training not converging or slowing down substantially.” However, the paper experimentally shows that it is not the case.

The two main findings of the paper are emphasized in italics.

1. ‘Deep neural networks easily fit random labels.’
2. ‘Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.’

Following the first statement, it was unclear as to what might be the relationship of the number of parameters and the dataset since any network can overfit if the number of data samples are very less as compared to the number of trainable parameters. So the results might not be well justified unless there is a clear comparison done with respect to the number of data samples. Later, it was made clear with the theoretical guarantees that if the number of parameters(p) is >= 2n+d, it can represent any function. This O(n+d) guarantee was insightful and new to me. It is quite surprising that the number of parameters is only in the linear order of ‘n’ and ‘d’.

The work tries to project that if the data has random labelling or even if the true pixels are randomly shuffled, standard classification models like Inception and AlexNet are able to achieve more than 99% train accuracy. Such was the observation even with regularization, which corroborates the claim that regularization isn’t the only factor playing behind generalization of a model.

Coming to the second point, an important statement was highlighted by the authors stating that ‘in contrast with classical convex empirical risk minimization, where explicit regularization is necessary to rule out trivial solutions, we found that regularization plays a rather different role in deep learning. It appears to be more of a tuning parameter that often helps improve the final test error of a model, but the absence of all regularization does not necessarily imply poor generalization error.’

Effective Capacity of Neural Networks

To gain an insight as to what might be the capacity of a network model, extensive experiments were performed where the true labels were replaced with random labelling as well as true pixels were randomly shuffled for the images. Defying the intuition that random labelling might show significant changes in convergence, these transformation of labels hardly interfered with the training process even after the use of explicit and implicit regularization.

Theoretical bounds like VC dimension and Rademacher complexity is commonly used as a measure of capacity of learning models. However, authors have challenged from their experimental results that such bounds does not seem to explain the concept of generalization.

Role of Regularization

Commonly used explicit regularization techniques like weight decay, data augmentation and dropout were covered on standard architectures and their results are shown in the table. It can be seen that even with regularization, architectures achieve an accuracy of over 99% concluding that regularization is not enough for generalization of models or simply, our naive understanding of generalization is flawed.

Early stopping was also shown to implicitly regularize on some convex problems and hence can be a potential technique to improve generalization. In summary, observations on both explicit and implicit regularizers are consistently suggesting that regularizers, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers removed.

Finite-Sample Expressivity

Coming down to the important question, how can the expressiveness of an architecture be defined? The authors have tried to incorporate a non-traditional approach where they claim that it is more relevant in practice is the expressive power of neural networks on a finite sample of size n. Theoretically, they have deduced that

There exists a two-layer neural network with ReLU activations and 2n+d weights that can represent any function on a sample of size n in d dimensions.

However, the proof is sketched for ReLU activation function and it might be interesting to know what might be the relation for ‘tanh’ or ‘sigmoid’ activation function, or whether ‘Leaky ReLU’ or ‘ELU’ gives similar results.The statement “It is difficult to say that the regularizers count as a fundamental phase change in the generalization capability of deep nets” is again a new learning since regularization was thought as a way to make the neural nets more general. In addition to that, it was unclear as why AlexNet failed to converge on adding regularization despite the theoretical guarantees.

Implicit Regularization: An Appeal to Linear Models

In order to understand the source of generalization, the authors have used simple linear models from which insights can be drawn forth. The authors have posed a serious question indicating whether all global minima generalizes equally well or whether there is a way to determine whether one global minima will be better than the other in terms of generalization?

Manifesting the simple ‘kernel trick’ to uniquely identify a solution and using preprocessing techniques like using random convolution layers can yield quite surprising results where the test errors are pretty less even in the absence of regularization. This poses a grave questions about traditional generalization notions. However, questions like computing the Gram Matrix(for kernel trick) for n data samples requires a lot of computation which is infeasible, the exact reason why we shifted to iterative procedure remain answered.

Conclusion

In the end, the paper provides great insights regarding the pre-conceived notion of generalization and how it is flawed and cannot be used to explain the behaviour of different architectures. Consequently, models used are rich enough to memorize data. This situation poses a conceptual challenge to statistical learning theory as traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks. Thus, a formal measure of generalization is yet to be understood. This paper can be a great precursor for the research in explainability of neural architectures.