Key Learning & insights from ICLR’s Best Paper 2017 — “UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION” by Chiyuan Zhang et al. (2017)
The research done on this work majorly focuses on one question,
“What is it then that distinguishes neural networks that generalize well from those that don’t?”.
Generalization, Model Capacity & Regularization are some of the important techniques used for this research work. Generalization error is defined as the difference between training error & the testing error. As the model gets trained on more and more datasets, the difference between training error & testing error keeps on increasing. This leads to the problem of over-fitting.
The authors also talk about model capacity which mean flexibility of a network & all the types of inputs that it can fit in. Important starting point of this is Universal Approximation Theorem, this theorem states that,
“A feed-forward network with a single hidden layer containing a finite number of neurons, can approximate any continuous function on compact subsets of Euclidean Space”.
Scientist named George Cybenko proved this theorem for sigmoid activation function. But the only problem is, this theorem doesn’t give any information about the algorithmic learnability of those parameters. Which means that we can approximate any possible continuous function with a single layer but we might not be able to learn the weights of that layer very effectively. Vapnik-Chervonenkis Theory (VC Dimension) is another important theory which states that
“A classification model f with some parameter vector Θ is said to shatter a set of data points (x1, x2, ….xN) if, for all assignments of labels to those points, there exists a Θ such that the model f makes no errors when evaluating that set of data points”.
VC dimension of model f is the maximum number of data points that can be arranged so that f shatters them. So, the maximum number of data points the model f can label, that will be the VC dimension of our model and that’s how its related to model capacity. VC Dimension can help in calculating the probabilistic upper bound on the test error. And this can help in reducing the issues of generalization error. There is a mathematical equation which explains this in a much better way. This relation only holds when the number of parameters is far less than the number of samples. But when it comes to neural networks, we have parameters far higher than the number of samples. Due to this the relation doesn’t hold any strong validity. The paper also discusses about regularization and its various techniques like Explicit (Weight decay, Dropout, Data Augmentation) & Implicit Regularization (Early Stopping). In L2 (Weight Decay) Regularization, we will add another parameter called the weight decay parameter to the standard weight update formula. This causes additional subtraction to the standard weight update process which results in lowering the weight. Hence, it’s called weight decay because the layer weights becomes small. Dropout Regularization is another important technique were certain neurons are dropped randomly from the layers of networks. This removes reliance on individual neurons and makes the computation faster. It also helps the model to learn redundancies and makes them rely less to each and every individual neuron.
Another type of regularization is Data Augmentation which are domain specific transformations on the input data. Like for example if the model is trained on a bunch of original images of human face with exact same conditions then it will fail to give accurate results in the testing phases if the input images are highly zoomed in, flipped or rotated. To tackle with this issue, the model should be trained on all kinds of inputs. This can reduce errors and increase accuracy.
In experimental findings when we do randomization tests, there are various ways to classify the labels like true labels, partially corrupted labels, random labels, shuffled pixels, random pixels and gaussian technique in which we can assess the trained model. Results of these test show that accuracy was only maintained for true labelled datasets and the accuracy gradually started dropping as we try other techniques where the labels are either corrupted, shuffled or randomly assigned. In gaussian technique the researchers extracted random noise from the gaussian distributions of image and mapped the output which exhibited similar noise distribution.
The results of this experiments where categorized into three parts,
1. Learning curves (average_loss v/s thousand steps) where true labels where easy to fit in and showed negligible loss after a few steps, then for shuffled, random and gaussian the results were almost similar as all these three followed similar trend. The astonishing fact discovered was that it was even possible for the model to fit in the random labels after a few thousand steps, it took longer but it was possible.
2. Convergence slowdown (time to overfit v/s label corruption), in this on the label corruption we have completely random labels at the right side and completely perfect labels on left side. Neural networks chosen for this where Inception, AlexNet & MLP 1x512. It was observed that even if labels were random, researchers were able to fit the images to correct label with zero training error in less time and all parameters varied by some constant factor.
3. Generalization error growth (test_error v/s label corruption). In this the y-axis exactly represents the generalization error. For this plot researchers took same parameters. This helped them to understand that the selection of network plays a huge role in affecting the generalization error Conclusions & Implications from this experiment showed that deep neural networks can easily fit random labels.
The effective capacity of neural networks was sufficient for memorizing the entire data set. It was even observed that optimization on random labels was easy which means it took same amount of time to learn new inputs as that to memorize those inputs.
In Explicit Regularization tests, the test had two parts one was Inception on ImageNet and another was Inception on CIFAR10. The generalization should less difference between using regularization and not using regularization at all. Regularization certainly helps but not as much as they thought it would. Another thing which they noticed that using or not using batch normalization had very little effect on generalization. The only problem was, that training without batch normalization intended to be noisier. This helped them to realize that augmenting data is more powerful than weight decay and they noticed much bigger gains by changing the model architecture. This implies that explicit regularization may improve generalization, but its neither necessary nor by itself sufficient. Implicit regularization findings showed that early stopping could potentially improve generalization which means that having patience while running iterations can help in giving better results. Another observation was that batch normalization improves generalization despite the fact that its primary function and not designed as a regularization technique. Authors conclusion on this point is that both explicit & implicit regularizers could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization. This paper proves that even without regularizers, the model gives good performance.
The authors have also talked about two side notes, one is finite-sample expressivity of Neural Networks which talks about giving better performance on a finite amount of data, while previous literature's talked more about the population level where depth K networks are typical more powerful than depth K-1 networks. So, given a finite sample size n, even a two-layer neural network can represent any function once the number of parameters p exceeds the input sample n. Authors have also talked about a theorem which sates : There exists a two layer neural network with ReLU activation and 2n+d weights that can represent any function on a sample of size n in d dimensions. This theorem is an extension of Universal Approximation Theorem. Another side note was a discussion on Linear Models because they are easy to map and understand, this can help a lot while designing architectures for other models. Imagine n data-points (x,y) where x are d-dimensional feature vectors and y are labels, so in this case we will multiply weight matrix with the inputs x & y. This helps in minimization and in computing the loss between multiplication and labels. SGD helps in solving the above equations, for this the weight matrix must lie in the span of the data points. SGD gives us a unique solution for finding the global minimum. Further in this derivation authors talks about kernel matrix which actually turns out to be exactly the minimum l2-norm solution to Xw=y. Hence, SGD converges to the solution with minimum norm but the only problem is that minimum norm is not predictive of generalization performance.
Final conclusion of this paper is that it gives us a deep understanding on effective capacity of neural networks. The representative capacity of neural networks is strong enough to successfully shatter the training data even without correct or no labels. Optimization continues to be easy even when generalization is poor. The authors also talk about how SGD may use implicit regularization by converging to solutions with minimum l2-norm. The last point is that traditional measures of model complexity struggle to explain the generalization of large neural networks
Original Authors info: