Modern Theory of Deep Learning: Why Does It Work so Well
What can we learn from the latest research on the paradoxical effectiveness of Deep Learning “Alchemy”.
Deep Learning is currently being used for a variety of different applications. But is frequently criticised for lacking a fundamental theory that can fully answer why does it work so well. It’s only recently that the winner of the Test-of-Time award at the Conference on Neural Information Processing (NIPS) compared Deep Learning to Alchemy.
Although Generalization Theory that explains why Deep Learning generalizes so well is an open problem, in this article we would discuss most recent theoretical and empirical advances in the field that attempt to explain it.
The Paradox of Deep Learning
An “apparent paradox” with Deep Learning is that it can generalize well in practice despite its large capacity, numerical instability, sharp minima, and nonrobustness.
In a recent paper “Understanding deep learning requires rethinking generalization” it was shown that Deep Neural Networks (DNN) has enough capacity to memorize ImageNet and CIFAR10 datasets with random labels. And it is unclear why they find generalizable solutions on real data.
Another important issue with deep architectures is numerical instabilities. Numerical instabilities in derivative-based learning algorithms are commonly called exploding or vanishing gradients. Additional difficulties stem from instabilities of the underlying forward model. That is the output of some networks can be unstable with respect to small perturbations in the original features. In machine learning it is called non-robustness. An example of it is an adversarial attack illustrated in Figure 1.
Several studies based their arguments for generalization in deep learning on the flatness of minima of the loss function found by Stochastic Gradient Descent (SGD). However, recently it was shown that “Sharp Minima Can Generalize For Deep Nets” as well. More specifically, flat minima can be turned into sharp minima via re-parameterization without changing the generalization. Consequently, generalization can not be explained merely by the robustness of parameter space only.
The goal of generalization theory is to explain and justify why and how improving accuracy on a training set improves accuracy on a test set. The difference between these two accuracies is called generalization error or “generalization gap”. More rigorously generalization gap can be defined as a difference between the non-computable expected risk and the computable empirical risk of a function f on a dataset Sm given a learning algorithm A:
Essentially if we bound the generalization gap with a small value it would guarantee that Deep Learning algorithm f generalizes well in practice. Multiple theoretical bounds exist for generalization gap based on model complexity, stability, robustness etc.
Two measures of model complexity are Rademacher complexity and Vapnik‑Chervonenkis (VC) dimension. Unfortunately, for the deep learning function f known bounds based on Radamacher complexity grow exponentially in depth of DNN. Which contradicts practical observation that deeper nets fit better to the training data and achieve smaller empirical errors. Similarly, the generalization gap bound based on VC dimension grows linearly in the number of trainable parameters and cannot account for the practical observations with deep learning. In other words, these two bounds are too conservative.
A much more useful approach was recently proposed by K Kawaguchi, LP Kaelbling, and Y Bengio. Unlike others, they embrace the fact that usually deep learning models are trained using train-validation paradigm. Instead of bounding the non-computable expected risk with train error, they use validation error. In this view, they propose the following intuition on why deep learning generalizes well in practice: “we can generalize well because we can obtain a good model via model search with validation errors”. And prove the following bound for any δ > 0, with probability at least 1 − δ:
Importantly |Fval| is the number of times we use the validation dataset in our decision making to choose a final model, m is the validation set size. This bound explains why deep learning had the ability to generalize well, despite possible nonstability, nonrobustness and sharp minimas. Somewhat an open question still remains why we are able to find architectures and parameters that result in low validation errors. Usually, architectures are inspired by real-world observations and good parameters are searched using SGD that we would discuss next.
Stochastic Gradient Descent
SGD is an inherent part of modern Deep Learning and apparently one of the main reasons behind its generalization. So we would discuss its generalization properties next.
In a recent paper “Data-Dependent Stability of Stochastic Gradient Descent” authors were able to prove that SGD is on-average stable algorithm under some additional loss conditions. These conditions are fulfilled in commonly used loss functions such as logistic/softmax losses in neural nets with sigmoid activations. Stability, in this case, means how sensitive is SGD to small perturbations in the training set. They take it further to prove a data-dependant on-average bound for generalization gap of SGD in non-convex functions such as deep neural nets:
where m is a training set size, T number of training steps and γ characterizes how the curvature at the initialization point affects stability. Which lead to at-least two conclusions. First, the curvature of the objective function around the initialization point has crucial influence. Starting from a point in a less curved region with low risk should yield higher stability, i.e faster generalization. In practice it can be a good pre-screen strategy for picking good initialization params. Second, considering the full pass, that is m = O(T), we simplify the bound to O(m⁻¹). That is the bigger training set, the smaller generalization gap.
Interestingly, there is a number of studies investigating learning curves. Most of them show power-law generalization error, scaling as ε(m) ∼ mᵝ with exponent β = −0.5 or −1. Which is also aligned with the previously discussed paper. However, it’s important to mention the large-scale research by Baidu that was able to empirically observe this power-law (see Figure 2). However, the exponent β in real applications was between−0.07 and −0.35. Which has to be still explained by theory.
Additionally, there is both theoretical and empirical evidence of batch size effect on generalization of SGD. Intuitively a small batch training introduces noise to the gradients, and this noise drives the SGD away from sharp minima, thus enhancing generalization. In a recent paper from Google it was shown that the optimum batch size is proportional to the learning rate and the training set size. Or simply put in other words, “Don’t Decay the Learning Rate, Increase the Batch Size”. Similarly scaling rules were derived for SGD with momentum: Bopt ~1/(1 − m), where Bopt is an optimal batch size and m is momentum. Alternatively, all conclusions can be summarised by the following equation:
where 𝜖 is learning rate, N is training set size, m is momentum and B is batch size.
In the last few years, there has been an increasingly growing interest in the fundamental theory behind the paradoxical effectiveness of Deep Learning. And although there are still open research problems, modern Deep Learning by any means is far from being called Alchemy. In this article we discussed the generalization perspective on the problem which leads us to a number of practical conclusions:
- Select initialization params in a less curved region and lower risk. Curvature can be efficiently estimated by Hessian-vector multiplication method.
- Rescale the batch size, when changing momentum.
- Don’t decay the learning rate, increase the batch size.