Rethinking Generalization in Deep Learning
The ICLR 2017 submission “Understanding Deep Learning required Rethinking Generalization” [ICLR-1] is certainly going to disrupt our understanding of Deep Learning. Here is a summary of what they had discovered through experiments:
1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.
2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.
The authors actually introduce two new definitions to express what they are observing. The talk about “explicit” and “implicit” regularization. Drop out, data augmentation, weight sharing, conventional regularization are all explicit regularization. Implicit regularization is early stopping, batch norm, and SGD. It is an extremely odd definition that we’ll discuss.
I understand regularization ( see: http://www.deeplearningpatterns.com/doku.php/regularization ) as being of two types. I use the terms “Regularization by Construction” and “Regularization by Training”. There is the Regularization by Training that is the conventional use of the term. There is also the Regularization by Construction which is a consequence of the Model choices we select as we construct the elements of our network. The reason why there is a distinction, when mathematically they do appear equivalently as constraint terms, is that Regularization conventionally is not present after training, that is in the inference path. Regularization by Construction is always present, both in the training and the inference stages.
Now the paper has a distinction between explicit and implicit regularization and that is when the main intent of the method is to regularize. One does dropout to regularize, so it is explicit. One does batch normalization (BN) for normalizing the activations of the different input samples but it happens to also regularize, so it is implicit regularization. The distinction between the two is the purpose of regularization or not. The later being implicit generalization. The meaning is that the unintended consequence of the technique is regularization. So when a researcher does not think that a method would lead to regularization and to his surprise it does, then that is what they call ‘implicit’ regularization. I don’t think however Hinton expected Drop Out to lead to regularization. This is why I think the definition is extremely fuzzy, however, I understand why they introduced the idea.
The goal of regularization, however, is to improve generalization. That is also what BN does. In fact, for inception architectures, BN is favored over drop out. Speaking about normalization, there are several kinds, Batch and Layer normalization are the two popular versions. The motivation for BN is supposed to be Domain Adaptation. Is Domain Adaptation different from Generalization? Is not just a specific kind of generalization? Are there other kinds of generalization? If so, what are they?
The authors have made the surprising discovery that methods that don’t seem to generalization, more specifically SGD, in fact, does. Another ICLR 2017 paper An Empirical Analysis of Deep Network Loss Surfaces [ICLR-2] adds added confirmation to this SGD property. This paper shows empirically that the loss surfaces for different SGD methods differ from each other. This tells you that what is happening is very different from traditional optimization.
It reminds one of quantum mechanics, where probes affect observation. Here learning method affects what is learned. In this new perspective of neural networks, that of brute force memorization or alternatively holographic machines, then perhaps ideas of quantum mechanics may need to come in play. Quantum mechanics emerges because of the non-commutability of poisson brackets in classical dynamics. We have two variables, position, and momentum, that are inextricably tied together. In Deep Learning I have a hunch that there are more than two variables that are tied together that lead to regularization. We at least have 3 variables: learning method, network model and generative model that all seem to have an effect on generalization. The troubling discovery, however, is how ineffective conventional regularization appears to be. “Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.”
I think right now we have a very blunt instrument when it comes to our definition of Generalization. I wrote here that there are at least 5 different notions of generalization ( http://www.deeplearningpatterns.com/doku.php/risk_minimization ).
Definition 1: Error Response to Validation and Real Data
We can define it as the behavior of our system in response to validation data. That is against data that we have not included as part of the training set. We be a bit more ambitious and define it as behavior when the system is deployed to analyze real world data. We essentially would like to see our trained system perform accurately in the context of data it has never seen.
Definition 2: Sparsity of Model
A second definition is based on the idea of Occam’s Razor. That is, the simplest of explanations is the best explanation. Here we make certain assumptions about the form of the data and we drive our regularization to constrain the solution toward our assumptions. So for example in the field of compressive sensing, we assume that a sparse basis exists. From there we can drive an optimization problem that searches solutions that have a sparse basis.
Definition 3: Fidelity in Generating Models
A third definition is based on the systems ability to recreate or reconstruct the features. This is the approach taken by generative models. If a neural network is able to accurately generate realistic images, then it able to capture the concept of images in its entirety. We see this approach taken by researchers working on generative methods.
Definition 4: Effectiveness in Ignoring Nuisance Features
A fourth definition involves the notion of ignoring invariant features or nuisance variables. That is, a system is able to generalize well if it is able to ignore invariant features for its tasks. Remove away as many features as possible until you can’t remove any more. This is somewhat similar to the third definition however it tackles the problem from another perspective.
Definition 5: Risk Minimization
A fifth generalization definition revolves around the idea of minimizing risk. When we train our system, there is an uncertainty in the context in which it will be deployed. So we train our models with mechanisms to anticipate unpredictable situations. The hope is that the system is robust to contexts that have not been previously predicted. This is kind of a game theoretic definition. We can envision an environment where information will always remain imperfect and generalization effectively means executing a particular strategy within the environment. This may be the most abstract definition of generalization that we have.
I’m sure there are many more as we move to a more game theoretic framework of learning. In fact, one effective strategy for learning is driven by the notion of “Curiosity”.
Update: December 20, 2016: According to An early overview of ICLR 2017, Understanding deep learning requires rethinking generalization is the highest rated submission.
I would like to dissect and discuss this further, link up to me at http://www.linkedin.com/in/ceperez or send me an email at ceperez AT intuitonmachine.com