Double Descent

Arjun Ahuja
8 min readApr 11, 2021

Breakthroughs in Machine Learning are rapidly changing Society but Bias-Variance tradeoff, one of the most fundamental concepts in Machine Learning seems to be at odd with what is observed in more modern models, like Neural Networks and Gradient Boosting. Two recent papers [1] and [2] attempt to bridge this gap by reconciling the classical understanding and the modern practice within a unified performance curve.

Conventional Wisdom — “Larger models are worse”

Conventional wisdom in statistics is that “larger models are worse”, which is shown by the following bias variance tradeoff curve:

As we increase the Capacity of Hypothesis class(H) i.e. as we make our model more complex the model over-fits the training data hence the training risk/error reaches 0. Test risk/error also follows the same trend until we reach the “sweet spot”, which describes the best model for our use case, post this point the Test risk increases.

Conventional wisdom in machine learning suggests controlling the capacity of the function class H based on the bias-variance trade-off by balancing under-fitting and over-fitting.

  • If H is too small, all predictors in H may under-fit the training data (i.e., have large empirical risk) and hence predict poorly on new data.
  • If H is too large, the empirical risk minimizer may over-fit spurious patterns in the training data resulting in poor accuracy on new examples (small empirical risk but large true risk).

The textbook corollary of this curve is:

a model with zero training error is overfit to the training data and will typically generalize poorly

Yet practitioners routinely use larger and larger neural networks to improve test set accuracy without changing any other parameters, why does it improve accuracy? Do modern machine learning methods not follow the bias variance curve? In fact many practitioners try to achieve 0 training error, which classically correlates to modern being overfit. So what actually changed?

Double Descent

Modern machine learning methods exhibit a slightly different phenomenon as we go in increasing the complexity of our hypothesis class. The initial U-shaped curve aligns with classical understanding but beyond a certain point (coined as interpolation threshold) the test risk again starts to decrease. The paper by Mikhail Belkin et al. [1] describes this phenomenon as “Double Descent”.

The paper breaks the curve into two regions, classical regime (based on classical understanding) this is also termed as under-parameterized regime and a modern interpolating regime also termed as “over-parameterized” regime. Refer to the below figure for the modern bias-variance tradeoff curve:

Double Descent Curve

In modern machine learning methods (example Deep Neural Networks), the complexity of the model is also linked to number of learnable parameters in the model. Beyond a certain threshold if we increase the number of learnable parameters we reach this modern interpolating regime, where the curve shows a downward trend in test risk/error.

How many parameters are required to reach Interpolation threshold?

Though it depends on the input features and many other factors, the paper observes that the peak of the classical U-curve of bias variance tradeoff occurs when the number of parameters is roughly equal to the number of samples, post this point we can observe the modern double descent regime.

It is common to use neural networks with extremely large number of parameters. But to achieve interpolation for a single output (regression or two class classification) one expects to need at least as many parameters as there are data points. Moreover, if the prediction problem has more than one output (as in multi-class classification), then the number of parameters needed should be multiplied by the number of outputs. This is indeed the case empirically for neural networks. Thus, for instance, data sets as large as ImageNet, which has ∼1,000,000 examples and ∼103 classes, may require networks with ∼1,000,000,000 parameters to achieve interpolation; this is larger than many neural network models for ImageNet.

Results on real examples

The paper ran several experiments to show the existence of this double descent curve, one such experiment was run on famous MNIST dataset and the author observed the following graph for error vs number of parameters:

For MNIST n(number of samples) = 4000, and d(dimension) = 784, K( number of classes) = 10. The author observed the interpolation threshold(denoted by the dashed line in the above graph) in accordance with the predicted number of parameters (n.K).(mentioned in the previous section.)

Also as expected the test error starts decreasing after interpolation threshold.

Is Double Descent observed only for Neural Networks?

The paper also points that a similar phenomenon is also observed for some other machine learning methods like AdaBoost and Random Forrest. More generally the author points that there is evidence that a similar behavior is seen for family of functions explored by boosting Decision Trees and Random Forrest.

A Journel by Abraham J. Wyner et al.[4] gives empirical evidence that, when AdaBoost and Random Forests are used with maximally large (interpolating) decision trees, the flexibility of the fitting methods yield interpolating predictors that are more robust to noise in the training data than the predictors produced by rigid, non-interpolating methods (e.g., AdaBoost or Random Forests with shallow trees). This in turn is said to yield better generalization. The averaging of the (near)interpolating trees ensures that the resulting function is substantially smoother than any individual tree, which aligns with an inductive bias that is compatible with many real world problems.

Results for Random Forrest

The author again used subset of MNIST dataset for classification, and observed the following error vs number of parameters graph:

The complexity in case of Random Forrest is controlled by the number of trees (Ntree) and the maximum number of leaves allowed for each tree (Nmaxleaf). As can be seen from the above graph after a certain model complexity(interpolation threshold) the test error starts to go down.

A More Formal Approach

The paper by Preetum Nakkiran et al. [2] shows that double descent is a robust phenomenon that occurs in a variety of tasks, architectures, and optimization methods. In addition to this the paper shows a more general notion of double descent that goes beyond just the dependence on number of parameters.

Effective Model Complexity (EMC)

The paper defines effective model complexity of a training procedure as the maximum number of samples on which a model will achieve close to 0 training error. The effective model complexity depends not just on the data distribution and the architecture of the classifier but also on the training procedure and in particular increasing training time will increase the EMC.

Formally effective model complexity is defined as:

Here training process/procedure T is defined as any procedure which takes in a list of samples and outputs a classifier T(S) mapping data to labels.

Informally EMC can be broken up into three parts:

  • Under parameterized regime: If EMC is sufficiently smaller than n, any perturbation of training process T that increases its effective complexity will decrease the test error.
  • Critical regime: If EMC is approximately equal to number of training samples n, then a perturbation of training process T that increases its effective complexity might decrease or increase the test error.
  • Over parameterized regime: If EMC is sufficiently larger than n, any perturbation of training process T that increases its effective complexity will decrease the test error.

This is an extension from what was observed by the 1st paper by Belkin et al.

Effect of more label noise

The paper performs several experiments to study the test error of models of increasing size, when training to completion(for a fixed large number of optimization steps). The critical region exhibits distinctly different test behavior around the interpolation point and there is often a peak in test error that becomes more prominent in settings with label noise. This can be seen in the following figure(Setting — Cifar10 trained on Resnet18):

Sometimes more data hurts?

The paper describes an experiment performed on a language-translation task with no added label noise where adding more data actually hurts the performance in a particular regime. It is shown with the figure below:

Increasing the number of samples shifts the curve downwards towards lower test error. However, since more samples require larger models to fit, increasing the number of samples also shifts the interpolation threshold (and peak in test error) to the right.

For intermediate model sizes (red arrows), these two effects combine, and we see that training on 4.5x more samples actually hurts test performance.

Training longer can reverse over-fitting?

The paper states that training for longer periods of time can actually reverse over-fitting!

The charts above show test and train error as a function of both model size and number of optimization steps. For a given number of optimization steps (fixed y-coordinate), test and train error exhibit model-size double descent. For a given model size (fixed x-coordinate), as training proceeds, test and train error decreases, increases, and decreases again; the paper calls this phenomenon epoch-wise double descent.

In general, the peak of test error appears systematically when models are just barely able to fit the train set.

Conclusion

Both papers present a very interesting idea and try to bridge the gap between theory and practice of Machine Learning methods. The paper of Deep Double Descent further points out scenarios where certain conventional ways might not always give the best results, as an example the author mentioned that as against the general perception(“More data is better”), more data might actual lead to a worse test accuracy. The researchers should understand that this only happens in a certain regime and increasing the effective model complexity further might actually lead to much better results.

Even though the full understanding of mechanisms behind why double descent is an open question, I would like practitioners and researchers to keep this phenomenon in mind while building their models.

--

--