Beyond Overfitting and Beyond Silicon: The double descent curve

6 min readJan 15, 2020

Sacré Coeur, Paris. Picture courtesy of A. Cappelli

When Freeman Dyson went to Chicago seeking Fermi’s blessing on his results on the pseudoscalar theory of pions he was let down. The numerical agreement proof was disassembled by a famous quote of Von Neumann: With four parameters I can fit an elephant, and with five I can make him wiggle his trunk [Freeman Dyson — Fermi’s rejection of our work]. Whether or not Von Neumann knew that fitting could be effectively done with four complex parameters [4], the lesson remains: by sufficiently increasing the number of parameters, i.e. the model complexity, we can interpolate any kind of data. Just like when miners were mistaking Pyrite for gold, data miners can confuse overfitting models with learning.

Figure 1: The elephant was produced thanks to a blog post by John Cook and a code from Piotr Zolnierczuk

The U-shaped curve

Avoiding over-fitting is a well-known prescription in Data Science [9] and is taught in all statistics courses and in classic machine learning classes. Traditionally, the bias-variance trade-off is depicted as a U-shaped risk curve (Figure 2). Both the train and test errors are high when the model complexity is low, by increasing the number of parameters the two errors decrease until the number of parameters reaches the coveted sweet point. At that point, the training error will keep decreasing while the test error will start to increase. Much like the Pillars of Hercules (the westernmost extremity of the inhabitable world in the Antiquity) the interpolation point, where the number of parameters equals that of the sample size, has become the terminal phase of over-fitting in classical statistics. As a result of this thinking, the number of parameters is usually very carefully tuned, or at least, it used to.

Figure 2: Bias-variance trade-off. Train and test error as a function of model complexity. The sweet spot lies where the test error reaches its minimum value.

The double descent curve

After the cathartic experience of Dr. Dyson and decades of classical statistics, the data science community has mastered the art of finding that sweet spot. Enter deep learning and its attendant success, where classical prescriptions fail to explain the generalization properties of models containing millions of parameters (AlexNet in 2012 had 60M parameters to classify 1.2M examples, VGG-16 and VGG-19 both exceeded 100M parameters [3]).

In other words, how is it possible that models with millions of parameters are capable of reaching zero training error while performing extremely well on unseen data? Many different explanations were proposed (a nice summary can be found here), but the full picture is still unclear. What we experience is that by increasing the model complexity beyond the interpolation point, entering the so-called over-parameterized regime, the generalization error starts to decrease again. Surprisingly, despite perfect training accuracy, one can often reach a lower error than the minimum of the under-parameterized regime. The phenomenon is unexplained by classical prescriptions.

Figure 3: The analogy is the following: -fitting spheres in = reaching training error 0- where the # of spheres is the # of samples. On the left we can see how given a fixed volume, by increasing the density of particles is no longer possible to fit all of them. On the right instead, we have a fixed amount of samples that we can fit perfectly only by increasing enough the model complexity.

A first empirical explanation of this phenomenon has been provided by Geiger et al. [8], who show an analogy between training neural network and fitting dense repulsive particles within a finite volume (Figure 3). The U-shaped curve becomes a double descent curve (Figure 4) reconciling the under- and over-parameterized regimes (Belkin et al., [2]). Despite the lack of a rigorous mathematical framework for this phenomenon, it has been verified empirically [1] and studied analytically [5] for a simple two-layer neural network [6]. It has been successively extended to modern deep learning models (ResNets and Transformers) and generalized in terms of effective model complexity [10].

Building larger models could seem an easy way to improve performance, however, beyond a certain size, resource utilization is now becoming a major issue. One way to take this new constraint into account as proposed by Canziani et al. [3] who rank the ImageNet models' winners by weighing their test accuracy versus their resource utilization. To learn more about how the availability of computational power shaped the past of AI and how it is now influencing its future, check our previous blog post entitled Faith no Moore.

Figure 4: The double descent risk curve proposed by Belkin et al. [2]. It reconciles the classical prescriptions in the under parameterized regime with modern observations in the over parameterized.

Increasing complexity at the speed of light

At LightOn, our Optical Processing Unit (OPU) can generate hundreds of thousands of random features at the speed of light by exploiting the physics of an encoded laser beam passing through a random scattering medium [7]. Computing random features (an essential component of the model in [6]) with our optics-based hardware is easy, fast and energy-efficient.

We recovered the double descent curve for the MNIST and CIFAR10 datasets using the OPU and our lightonml Python library (Figures 5 and 6). Both curves match the empirical observations of the double descent curve phenomenon.

We used random features followed by a RidgeClassifier from scikit-learn. We show results both for the simulation of our optical hardware (Synthetic OPU in the figures) and for computations performed really with laser light (OPU).

The interpolation point lies at the number of training examples when using ridge regression.

Figure 5: Double descent curve on MNIST using 10.000 fixed training samples (out of 50.000) and varying # of random features.

Figure 6: Double descent curve on CIFAR10 using 10.000 fixed training samples (out of 50.000) and varying # of random features. Interestingly, the OPU is better than its simulation in the over-parametrized regime, this might be due to the pre-processing used to transform the data in a binary representation.

These curves are reproducible utilizing our hardware and the lightonml library on LightOn Cloud; you can check the code here and if you are interested in trying our optical processing technology do not hesitate to contact us!

The Author

Alessandro Cappelli, Machine Learning Engineer at LightOn AI Research.

References

[1] Belkin, Mikhail, Daniel Hsu, and Ji Xu. “Two models of double descent for weak features.” arXiv preprint arXiv:1903.07571 (2019).

[2] Belkin, Mikhail, et al. “Reconciling modern machine-learning practice and the classical bias–variance trade-off.” Proceedings of the National Academy of Sciences 116.32 (2019): 15849–15854.

[3] Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. “An analysis of deep neural network models for practical applications.” arXiv preprint arXiv:1605.07678 (2016).

[4] Mayer, Jürgen, Khaled Khairy, and Jonathon Howard. “Drawing an elephant with four complex parameters.” American Journal of Physics 78.6 (2010): 648–649.

[5] Mei, Song, and Andrea Montanari. “The generalization error of random features regression: Precise asymptotics and double descent curve.” arXiv preprint arXiv:1908.05355 (2019).

[6] Rahimi, Ali, and Benjamin Recht. “Random features for large-scale kernel machines.” Advances in neural information processing systems. 2008.

[7] Saade, Alaa, et al. “Random projections through multiple optical scattering: Approximating kernels at the speed of light.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.

[8] Geiger, Mario, et al. “Jamming transition as a paradigm to understand the loss landscape of deep neural networks.” Physical Review E 100.1 (2019): 012115.

[9] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.

[10] Nakkiran, Preetum, et al. “Deep double descent: Where bigger models and more data hurt.” arXiv preprint arXiv:1912.02292 (2019).