Beyond Overfitting and Beyond Silicon: The double descent curve
When Freeman Dyson went to Chicago seeking Fermi’s blessing on his results on the pseudoscalar theory of pions he was let down. The numerical agreement proof was disassembled by a famous quote of Von Neumann: With four parameters I can fit an elephant, and with five I can make him wiggle his trunk [Freeman Dyson — Fermi’s rejection of our work]. Whether or not Von Neumann knew that fitting could be effectively done with four complex parameters [4], the lesson remains: by sufficiently increasing the number of parameters, i.e. the model complexity, we can interpolate any kind of data. Just like when miners were mistaking Pyrite for gold, data miners can confuse overfitting models with learning.
The U-shaped curve
Avoiding over-fitting is a well-known prescription in Data Science [9] and is taught in all statistics courses and in classic machine learning classes. Traditionally, the bias-variance trade-off is depicted as a U-shaped risk curve (Figure 2). Both the train and test errors are high when the model complexity is low, by increasing the number of parameters the two errors decrease until the number of parameters reaches the coveted sweet point. At that point, the training error will keep decreasing while the test error will start to increase. Much like the Pillars of Hercules (the westernmost extremity of the inhabitable world in the Antiquity) the interpolation point, where the number of parameters equals that of the sample size, has become the terminal phase of over-fitting in classical statistics. As a result of this thinking, the number of parameters is usually very carefully tuned, or at least, it used to.
The double descent curve
After the cathartic experience of Dr. Dyson and decades of classical statistics, the data science community has mastered the art of finding that sweet spot. Enter deep learning and its attendant success, where classical prescriptions fail to explain the generalization properties of models containing millions of parameters (AlexNet in 2012 had 60M parameters to classify 1.2M examples, VGG-16 and VGG-19 both exceeded 100M parameters [3]).
In other words, how is it possible that models with millions of parameters are capable of reaching zero training error while performing extremely well on unseen data? Many different explanations were proposed (a nice summary can be found here), but the full picture is still unclear. What we experience is that by increasing the model complexity beyond the interpolation point, entering the so-called over-parameterized regime, the generalization error starts to decrease again. Surprisingly, despite perfect training accuracy, one can often reach a lower error than the minimum of the under-parameterized regime. The phenomenon is unexplained by classical prescriptions.
A first empirical explanation of this phenomenon has been provided by Geiger et al. [8], who show an analogy between training neural network and fitting dense repulsive particles within a finite volume (Figure 3). The U-shaped curve becomes a double descent curve (Figure 4) reconciling the under- and over-parameterized regimes (Belkin et al., [2]). Despite the lack of a rigorous mathematical framework for this phenomenon, it has been verified empirically [1] and studied analytically [5] for a simple two-layer neural network [6]. It has been successively extended to modern deep learning models (ResNets and Transformers) and generalized in terms of effective model complexity [10].
Building larger models could seem an easy way to improve performance, however, beyond a certain size, resource utilization is now becoming a major issue. One way to take this new constraint into account as proposed by Canziani et al. [3] who rank the ImageNet models' winners by weighing their test accuracy versus their resource utilization. To learn more about how the availability of computational power shaped the past of AI and how it is now influencing its future, check our previous blog post entitled Faith no Moore.
Increasing complexity at the speed of light
At LightOn, our Optical Processing Unit (OPU) can generate hundreds of thousands of random features at the speed of light by exploiting the physics of an encoded laser beam passing through a random scattering medium [7]. Computing random features (an essential component of the model in [6]) with our optics-based hardware is easy, fast and energy-efficient.
We recovered the double descent curve for the MNIST and CIFAR10 datasets using the OPU and our lightonml Python library (Figures 5 and 6). Both curves match the empirical observations of the double descent curve phenomenon.
We used random features followed by a RidgeClassifier from scikit-learn. We show results both for the simulation of our optical hardware (Synthetic OPU in the figures) and for computations performed really with laser light (OPU).
The interpolation point lies at the number of training examples when using ridge regression.
These curves are reproducible utilizing our hardware and the lightonml library on LightOn Cloud; you can check the code here and if you are interested in trying our optical processing technology do not hesitate to contact us!
The Author
Alessandro Cappelli, Machine Learning Engineer at LightOn AI Research.
References
[1] Belkin, Mikhail, Daniel Hsu, and Ji Xu. “Two models of double descent for weak features.” arXiv preprint arXiv:1903.07571 (2019).
[2] Belkin, Mikhail, et al. “Reconciling modern machine-learning practice and the classical bias–variance trade-off.” Proceedings of the National Academy of Sciences 116.32 (2019): 15849–15854.
[3] Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. “An analysis of deep neural network models for practical applications.” arXiv preprint arXiv:1605.07678 (2016).
[4] Mayer, Jürgen, Khaled Khairy, and Jonathon Howard. “Drawing an elephant with four complex parameters.” American Journal of Physics 78.6 (2010): 648–649.
[5] Mei, Song, and Andrea Montanari. “The generalization error of random features regression: Precise asymptotics and double descent curve.” arXiv preprint arXiv:1908.05355 (2019).
[6] Rahimi, Ali, and Benjamin Recht. “Random features for large-scale kernel machines.” Advances in neural information processing systems. 2008.
[7] Saade, Alaa, et al. “Random projections through multiple optical scattering: Approximating kernels at the speed of light.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.
[8] Geiger, Mario, et al. “Jamming transition as a paradigm to understand the loss landscape of deep neural networks.” Physical Review E 100.1 (2019): 012115.
[9] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
[10] Nakkiran, Preetum, et al. “Deep double descent: Where bigger models and more data hurt.” arXiv preprint arXiv:1912.02292 (2019).