Spotlight on the bias-variance trade-off

Jean Czerlinski Whitmore Ortega
6 min readSep 6, 2022

--

Why more data is not always better

U-shaped test error curve.
Typical plot of error relative to the number of parameters fit in a model. Figure by the author, based on figure 1a from [1].

TL;DR The bias-variance trade-off explains why smaller models– having fewer parameters to train on– can be better.

It is intuitive that more data is better. A model trained on more data should be more accurate than a model trained on less data, right? It turns out the situation is more complicated than that. More data can mean more examples or more parameters. And it has been known for decades that the best number of parameters is not necessarily all parameters. There is usually a “sweet spot” at a middling number of parameters when predicting out-of-sample. But this isn’t just an empirical observation — we have a theoretical explanation for why, using a decomposition of error into bias and variance.

The role of number of parameters

The impact of the number of parameters of a model actually depends on whether you are fitting a data set or trying to make predictions on additional data, outside of your original training sample. More parameters always improve the fit to a data set, just like more pixels on a camera always improve the realism of the photo. With enough parameters, the model can interpolate the data, meaning the error is zero. This is called the interpolation threshold, and it happens when the number of parameters equals the number of examples, allowing the examples to be fit perfectly. You can add still more parameters, but the additional parameters cannot reduce error because it’s already zero.

However, if these models are used to predict a different sample of data, such as the test data, then the error typically increases as the interpolation threshold is approached. Plotting the error on the test data set relative to the number of parameters typically results in a U-shaped curve. To minimize test error, the optimal number of parameters lies between 0 and the interpolation threshold.

Typical plot of error relative to the number of parameters fit in a model. The training error (dashed line) approaches zero as the interpolation threshold (dot dashed line) is approached. The test error (solid line) shows the classical U shape from the bias-variance trade-off. The number of parameters to minimize test error is indicated by the “sweet spot.” Figure by the author, based on figure 1a from [1].

Why? The bias-variance decomposition

Why is there a U-shaped curve? The traditional theoretical approach to explaining this pattern of test error is to decompose the error into bias and sampling variance. Every extra parameter added to a model decreases bias but increases sampling variance in test error. As a result, there is a bias-variance trade-off [2]. Importantly, as additional parameters are added, the additional error from sampling variance tends to dominate– which is why people thought good performance with billions more parameters than examples should be “impossible.”

Let me dive deeper into the two components of error for those unfamiliar with this theory. A model’s error is the difference between its prediction and the true value. Consider the best model that can be fit in a given model class, that is, having the parameter weights that bring its predictions as close as possible to the true values (ground truth). Then we can decompose the test error into two parts relative to this best model:

  • Bias is the difference between the predictions of the best model in a model class and the true values.

Of course, we normally do not know the best model in a model class– we can only attempt to estimate it by fitting models to data. That leads to the other component of error:

  • Sampling variance is the difference between a typical model fitted on an actual sample of data (which will vary for every sample) and the best model in the model class.

Usually it’s a bit hard to understand these two components, particularly sampling variance.

A spotlight analogy

Here is an analogy I like, where the data dimensions and model dimensions are the same, making visualization a bit easier. Suppose we are trying to estimate the location of a sphere floating in the air, meaning its true location has three dimensions. Now suppose our model class was restricted to the two dimensions of the ground (both in parameters to be estimated and in output prediction), so any model’s prediction would be just a shadow of the sphere. Then the best model in this class is the shadow directly under the sphere, which is where the sphere’s shadow would be if a light were directly overhead. Then the “bias” component of the error would be the sphere’s height above this shadow. But in reality we do not know where to mount the light to get the best model. Instead, every sample of data can be seen as a light mounted at a different place on the ceiling, casting a different shadow on the floor, which is the model’s prediction of the sphere’s location. The spread of these shadows on the floor are then the sampling variance.

Test error decomposed into bias and sampling variance using a sphere of reality. Samples of data are spotlights casting shadows onto our model space. The shadow directly under the sphere is the best-fitting model in the model class. Image by the author using manual annotation and this code: https://jsfiddle.net/jeanimal/8b7txh40/75/

The total error of a fitted model is the vector sum of the bias and its sampling variance. Applying the pythagorean theorem then gives us this well-known formula:

Error² = bias² + variance

Notice that variance is already squared, since it is defined as the square of the standard deviation. (This formula also assumes that the bias and variance are not correlated, which visually means they are at right angles.)

But the formula is not as important as the intuition. And from that formula, the effect of number of parameters is not obvious

So let’s return to the spotlight analogy. If we restricted the model class to just one dimension, say latitude only, then the bias would increase because the best shadow would not be directly under the sphere but would be the closest point at that latitude. The increased bias causes an increase in error. However, the range of possible shadows would be restricted to the line, so sampling variance would decrease, causing a decrease in error. So would the latitude restriction result in better models with lower overall error? That depends. The increased bias could be compensated for by the decreased sampling variance, particularly on noisy data sets that cast shadows all over the place.

Alternatively, consider adding a third dimension to the two-dimensional shadow models, having the models estimate the best height off the “floor.” Then bias is zero because the best model can exactly match the location of the sphere. But will it, given the samples of data? Clearly, the three-dimensional model has much more sampling variance because it has a wider range of estimates given that it has three dimensions to estimate. Whether the overall error is lower or not depends on how noisy the estimates are, given the spotlights shined by data.

What if you added even more dimensions? Suppose a model of four or even five dimensions, which are hard to visualize. Bias will still be zero — adding unnecessary dimensions has no effect on bias. But there will be a lot more possible model fits to data. Theoretically, the sphere’s data sample would never cast a shadow in the fourth or fifth dimensions, but if the data set is very noisy, measurements in these dimensions may be non-zero, greatly increasing sampling variance.

Summary so far

Samples of data are spotlights on reality, and we interpret the shadows on our model parameters. Too few parameter dimensions, and even our best model with the best data is still very far from reality. Too many parameters, and the multi-faceted shadows from the spotlights will just confuse us.

Put in less metaphorical terms, to minimize test error when predicting new data, the ideal model uses exactly the necessary parameters. Fewer parameters than necessary will underfit the data, leading to bias. More parameters than necessary will overfit, risking non-zero estimates of unnecessary parameters and increasing the model’s sampling variance. Strive for a happy middle.

Beyond bias-variance

But deep learning models have millions of parameters and can perform well — how is that possible, given the bias-variance tradeoff? It turns out the bias-variance trade-off is incomplete. It failed to study the complexities of error when you have many more parameters than the interpolation threshold. See my post on escaping the bias-variance tradeoff.

Bibliography

[1] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine learning practice and the bias-variance trade-off,” ArXiv181211118 Cs Stat, Sep. 2019, Accessed: Oct. 29, 2021. [Online]. Available: http://arxiv.org/abs/1812.11118

[2] T. Hastie, T. Hastie, R. Tibshirani, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction. New York: Springer, 2001.

--

--

Jean Czerlinski Whitmore Ortega

Ex-Google engineer modeling things and celebrating non-things: machine learning, incentives, behavior, ethics, physics.