How millions of parameters can avoid overfitting

Jean Czerlinski Whitmore Ortega
12 min readJan 3, 2022

--

Both linear regression and deep learning can leverage a massive number of mis-specified features

Double descent of test error: Figure 1 from [2].

TL;DR The “impossible” phenomenon of minimizing test error with “too many” parameters does not just appear in deep neural nets– it also appears in linear regression. A recent paper sheds light on why, with a key role played by misspecified features.

UPDATE: There is now a video version of this post: https://youtu.be/bM6WJVyytEg

Deep learning has proven amazingly effective at tasks ranging from speech recognition to object detection in images. And yet, as one author bluntly put it, “These empirical results should not be possible” [1].

Why? Although deep neural nets are trained on what are called “big data” sets with millions of training examples, the neural nets have significantly more tunable parameters than examples, meaning they are “overparameterized.” According to statistical learning theory, such models should suffer from high test error, making inaccurate predictions on a new sample of data. The models should be like a student who memorized the answers to an exam but flunks when the exam turns out to have different questions. Even more intriguing, while this “impossible” phenomenon was first observed in deep neural nets, it has since been observed in far simpler models, including wide neural networks, kernel methods, and now even linear regression models.

Linear regression caught my eye. Linear regression’s simplicity made me feel I had a chance of understanding this mystery. So I dug into recent papers, with a focus on Dar et al.’s 2021 paper, “Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning” [2].

To give a flavor of their argument, let me construct an example with contrasting data sets. Suppose the goal is to predict a student’s university entrance exam score, and two data sets are available. The “scores” data set has scores from prior tests in key areas such as math, science, reasoning, and language. The “social media” data set has the logs from each student’s social media activities. A linear model using the “exams” data set is likely to minimize test error with a low number of these high-signal parameters, gaining no improvement in test error from overparameterization. In contrast, a linear model using the “social media” data set has a lot of low-signal data, so a few parameters cannot perform well. Instead, there is a benefit to adding more and more parameters so the model can extract a useful signal. Overparameterization can then improve even the test error.

I observe that in our digital world, we are getting more of the “social media” type of data sets, which are massive but include a lot of low-signal parameters. Perhaps that is one factor in why we are noticing the benefits of overparameterization now.

So why are the low-signal data sets favorable to overparameterization? I will begin by reviewing statistical learning theory, where the bias-variance trade-off led many modelers to fear having too many parameters. Next I will describe the empirically observed “double descent,” where a massive number of parameters can lead to a second reduction in test error beyond the bias-variance regime. Finally I will summarize the explanation of this phenomenon for linear regression, which hopefully will lead to some intuition for the “impossible” findings.

Bias-variance refresher: the role of number of parameters

The impact of the number of parameters of a model actually depends on whether you are fitting a data set or trying to make predictions on additional data, outside of your original training sample. More parameters always improve the fit to a data set, just like more pixels on a camera always improve the realism of the photo. With enough parameters, the model can interpolate the data, meaning the error is zero. This is called the interpolation threshold, and it happens when the number of parameters equals the number of examples, allowing the examples to be fit perfectly. You can add still more parameters, but the additional parameters cannot reduce error because it’s already zero.

However, if these models are used to predict a different sample of data, such as the test data, then the error typically increases as the interpolation threshold is approached. Plotting the error on the test data set relative to the number of parameters typically results in a U-shaped curve. To minimize test error, the optimal number of parameters lies between 0 and the interpolation threshold.

Typical plot of error relative to the number of parameters fit in a model. The training error (dashed line) approaches zero as the interpolation threshold (dot dashed line) is approached. The test error (solid line) shows the classical U shape from the bias-variance trade-off. The number of parameters to minimize test error is indicated by the “sweet spot.” Figure by the author, based on figure 1a from [3].

Why? The bias-variance decomposition

The traditional theoretical approach to explaining this pattern of test error is to decompose the error into bias and sampling variance. Every extra parameter added to a model decreases bias but increases sampling variance in test error. As a result, there is a bias-variance trade-off [4]. Importantly, as additional parameters are added, the additional error from sampling variance tends to dominate– which is why people thought good performance with billions more parameters than examples should be “impossible.”

For more on why the bias-variance trade-off prefers fewer parameters, see my deeper dive into the bias-variance trade-off.

Now the amazing thing about this recent overparameterization work is that it seems like it’s saying, “Just fit a 100-dimensional deep neural net and your test error will go lower again, possibly lower than the two or three-dimensional model.” That is why it seems so impossible.

Regularization

There was, however, one workaround. Performant models could be created in a high-dimensional space of parameters by constraining the model fitting with what is called regularization. Instead of simply minimizing the error– the model prediction’s discrepancies from the data– we minimize a loss function that is a combination of error and a penalty term for exploring “too much” of the available parameter space. For example, an ℓ1 norm encourages parameter values to be zero and penalizes each additional non-zero parameter. In the case of the sphere prediction, the ℓ1 norm penalty could prevent non-zero model parameter weights in the fourth or fifth dimension unless there was more data in these dimensions than in the original three. (Linear regression with an ℓ1 norm is known as lasso regression, and with an ℓ2 norm, it is known as ridge regression.)

Regularization has proven to be an effective (although somewhat ad hoc) tool to prevent error due to sampling variance from exploding as parameters are added. So it is consistent with the bias-variance trade-off.

The amazing thing about the big data results, though, is that low test error can be found in high dimensions even without explicit regularization. This seems to be a contradiction with the theory.

Double Descent

In 2018, Belkin and coauthors [3] suggested that the apparent contradiction of theory and observation could be resolved with a unified performance curve that they dubbed “double descent.” Their test error curve has an “underparameterized” regime to the left of the interpolation threshold. This regime has the traditional U-shaped bias-variance curve, which is the first descent. Then it adds an “overparameterized” regime to the right of the interpolation threshold. In the overparameterized regime, test error can decrease again with more parameters, a second descent. Below is a simplified diagram from Dar et al. [2].

Double descent of test error: Figure 1 from [2] with original caption. TOPML is an acronym for “theory of overparameterized ML.”

As I mentioned in the introduction, originally double descent was observed in deep neural networks. And much of the theory trying to understand the pattern leverages the special features of deep neural networks and the gradient descent method of fitting. But recent articles showed how and when linear regression also exhibits double descent — the phenomenon is not just for deep neural networks. Let’s consider these now. (That said, there are additional effects that happen specifically in a deep neural network rather than a linear regression.)

Double descent in linear regression

Linear regression means fitting a plane (or a hyperplane) to a data set. Finding the plane that minimizes the error does not require machine learning techniques like gradient descent. In fact, the solution for the case where the number of parameters (columns in the design matrix) is less than the number of data points (rows in the design matrix) has been known for more than a century (for standard assumptions, of course). The solution technique does not work in the overparameterized regime because the design matrix cannot be inverted when there are more columns than rows. Besides, nobody seemed to suspect that a useful model would be obtained.

But it can be useful! And by working through many scenarios of different data distribution assumptions, Dar et al. derived necessary and sufficient conditions when double descent would occur in a linear regression. Below is a somewhat intuitive (but inevitably inaccurate) summary of the conditions in section 3.3.3 [2]:

  1. Low-dimensional signal structure in the data (where signal means de-noised).
  2. Low effective dimension in the data (a few large eigenvectors). (See [5] for a tutorial on effective dimension and its distinction from intrinsic dimension.)
  3. Alignment: The highest-signal data dimensions must be aligned with the highest effective dimensions (largest eigenvectors in the data).
  4. An abundance of low-signal (but non-zero) features in the model.

If these are hard to parse, then focus on the fact that a model-dataset pair satisfying these conditions has mis-specified features. As the authors put it, “The amplified misspecification due to a poor selection of feature space actually promotes increased benefits from overparameterization (compared to underparameterized solutions, which are extremely misspecified).”

In other words, if a human builds a regression with wisely-chosen features, then a small model may perform well, and there will likely be less benefit to overparameterization — possibly no benefit. But if you throw together a grab-bag of features from, say, some incidentally recorded internet logs, then over-parameterization is more likely to be beneficial — in a way just because the underparameterized model was so bad. And of course most of our machine learning modeling in the internet age is in the latter category. This finding is reminiscent of the “strength of weak learners’’ in ensemble models and boosting.

Recall the constructed example I gave before, where predicting university entrance exam scores with the “scores” data set of other test scores did not benefit from more parameters (no double descent) but predicting with the “social media” data set would benefit from more parameters (double descent). The more parameters we add, the more chance of hitting a useful data source, like a homework help chat room, and so overparameterization helps. The “social media” data set is mis-specified.

Returning to the example of locating a floating sphere, suppose instead we are trying to locate a cell phone. In theory (if authorization were granted), we could use various telemetry sources such as logs from cell phone towers, nearby wifi hotspots, or bluetooth sensing, also potentially social media posts and check-ins from the person using the phone, or photos from security cameras using image recognition. We would have a more accurate location model with 100 features than 3 features, because each of these features individually is a weak signal. These features are not directly related to the X, Y, Z coordinates of the phone: they are mis-specified.

An earlier article by Hastie et al. (2019) also commented on the intuition behind double descent with a misspecified linear regressions:

“In this section, we consider a misspecified model, in which the regression function is still linear, but we observe only a subset of the features. Such a setting provides another potential motivation for interpolation: in many problems, we do not know the form of the regression function, and we generate features in order to improve our approximation capacity. Increasing the number of features past the point of interpolation… can now decrease both bias and variance” [6].

The examples should make clear why we are noticing the benefits of overparameterization more now because we have so many large and mis-specified data sets thanks to our digitized world. We have an abundance of low-but-non-zero signal data. Relative to that, we have a scarcity of human-curated data sets, whose parameters are likely to be well-specified.

Just tell me what to do

All this theory is nice, but some readers just want to know what to do.

  • If you have a data set of well-specified features, go ahead and use them and seek the few features that work well in the traditional underparameterized regime. No need to collect more features, as you are unlikely to benefit. (Even if you observe a second descent in test error, it is unlikely to be less than the minimal underparameterized test error.)
  • If you have a data set with a large number of low-signal features, use as many as you can and go deep in the overparameterized regime in search of double descent.

I have constructed an artificial example to compare these two cases explicitly. I took an existing well-specified data set and then corrupted it with noise and added extra parameters so that it became a mis-specified data set. I then fitted linear regressions and measured their test errors, resulting in the graph below.

Linear regressions fit to a well-specified data set (blue) have lower test error than linear regressions fit to a mis-specified data set (red). Only the mis-specified data set exhibits double descent. Image by the author from the code on my github.

Let me walk through that step by step. The lowest error is achieved with linear regressions fit on a well-specified data set (red) with 4 of the 10 parameters. The mis-specified data set (blue), in contrast, exhibits the classical double-descent, with a low error to the left of the interpolation threshold (which is at p=10) and a second low error to the right, with lots of parameters. The mis-specified data set (red) achieves its lowest error with the most parameters, p=30. BUT THIS IS STILL NOT AS LOW as the lowest error in the well-specified data set (blue). Full details of how I constructed these data sets and fitted linear regression are on my github.

The take-home: Well-specified data is the best. But if you have a mis-specified data set, then use as many parameters you can get. And if you don’t know what kind of data you have, then you will have to experiment.

Postscript: Drilling into misspecification

If you want to dig a bit more into the theory, Dar and coauthors show how bias and variance can be further decomposed into a misspecification and in-class component. Below is an example they use to illustrate the behavior of their additional components:

Figure 4 extends Fig. 2(d) by illustrating the misspecification and in-sample components of the bias and variance versus the number of parameters p in the function class. Indeed, the misspecification bias decreases as the function class is more parameterized. In contrast, the in-class bias is zero in the underparameterized regime, and increases with p in the overparameterized regime. Yet, the reduction in the misspecification bias due to overparameterization is more significant than the corresponding increase in the in-sample bias; and this is in accordance with having the global minimum of test error in the overparameterized range. We can observe that both the in-class and misspecification variance terms peak around the interpolation threshold (i.e., p = n) due to the poor conditioning of the input feature matrix. Based on the structure of the true parameter vector (see Fig. 2(e)), the significant misspecification occurs for p < 20 and this is apparent for the misspecification components of both the bias and variance [2].

Bias-variance decomposition including mis-specification. Figure 4 from [2] with original caption.

Note that this behavior is specific to the assumptions in the example.

Dar and co-authors note that there is a “rich variety of generalization behaviors” in the overparameterized regime of linear regression [2]. My goal here was merely to show how the “impossible” double descent behavior could seem possible– even to our intuition– using a straightforward linear model.

Addendum

I think the insights here likely apply to the finding that deep learning does not beat other methods in “tabular data.” Tabular data has typically been curated and collected for some purpose, meaning many of the features are well-specified. See for example these:

However, I have not yet followed up on the connection.

Acknowledgments

Thank you to Yehuda Dar for correcting several mistakes in prior versions. Any remaining mistakes (and there probably are a few) are my own.

Bibliography

[1] T. J. Sejnowski, “The unreasonable effectiveness of deep learning in artificial intelligence,” Proc. Natl. Acad. Sci., vol. 117, no. 48, pp. 30033–30038, Dec. 2020, doi: 10.1073/pnas.1907373117.

[2] Yehuda Dar, V. Muthukumar, and R. Baraniuk, “A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning,” Cornell University, Sep. 06, 2021. Accessed: Oct. 28, 2021. [Online]. Available: https://arxiv.org/abs/2109.02355

[3] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine learning practice and the bias-variance trade-off,” ArXiv181211118 Cs Stat, Sep. 2019, Accessed: Oct. 29, 2021. [Online]. Available: http://arxiv.org/abs/1812.11118

[4] T. Hastie, T. Hastie, R. Tibshirani, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction. New York: Springer, 2001.

[5] M. Del Giudice, “Effective Dimensionality: A Tutorial,” Multivar. Behav. Res., Mar. 2020, doi: 10.1080/00273171.2020.1743631.

[6] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” ArXiv190308560 Cs Math Stat, Dec. 2020, Accessed: Jan. 01, 2022. [Online]. Available: http://arxiv.org/abs/1903.08560

--

--

Jean Czerlinski Whitmore Ortega

Ex-Google engineer modeling things and celebrating non-things: machine learning, incentives, behavior, ethics, physics.