Bias Variance Tradeoff — Intuition and Examples

6 min readApr 26, 2022

The tradeoff is such an important concept in machine learning, and yet most posts got it wrong.

Let’s straighten it today.

Outline

Data generation.
Ideas behind the regressive model
Reducible and irreducible errors
Model bias and variance
The BV tradeoff
The splitting of data, and hyperparameter tuning.
No free lunch theorem.

1. Noise and data generation

Had you understood the exact process of how data was generated, it still varies when repeating the same measurement. We treat the unexplainable variation as noise. It is aggregated contribution from many unknown and mostly independent sources, and thus approximately Gaussian distributed according to the Central limit theorem. How mathematically convenient!

Let’s generate a sinusoidal curve: y = f(x) plus some additive gaussian noise. Here, we know the true relationship between the feature x and corresponding output y, which is governed by a sine function.

2. Ideas behind the regression model

The goal is to find a mapping between the feature (X) and response y = \hat{f}_D(x) that best captures the patterns in the training data D. The first level assumption to make before model fitting is the type of model (Linear, polynomial, or neural nets). A linear model is more constrained than a polynomial one in the sense that a linear is a subset of a polynomial model. It’s just a matter of putting all high order coefficients to zeros, but not the other way around. For those used to thinking in Bayesian, the choice of linear versus nonlinear is effectively imposing two kinds of priors on the problem. Choose your prior wisely, because every step and anything that happens afterward will be decorating the prior belief.

Let’s try a linear and polynomial fit on the same training set.

3. Reducible and irreducible error

The irreducible error, also known as the Bayes error, is the lowest possible prediction error that one can achieve. Use the above example, the best you can do is know that the true relationship y = sine(x), but still you cannot overcome the inherently stochastic noise. Minimizing the reducible error is what we practice all the time in machine learning. Let’s use MSE as the evaluation metrics, it can be written as the sum of reducible and irreducible errors:

Linear MSE = 0.461
Poly MSE = 0.304
Irreducible = 0.250

4. Model bias and variance

We are talking about the model estimator’s bias and variance, which is very different from the bias and variance in random variables.

The decomposition of model Bias and Variance is:

Here E_D[ ] means the expectation over data sets in training. The Bayes limit mentioned above can be reached if and only if we have exhaustive access to all the data (D) and pinpoint the exact process of generating the data f(x). Given partial observations (D1), we find a set of parameters to optimize wrt some objective according to prior knowledge (linear or nonlinear). Imagine we are given (D2, D3, …), what would the model look like? Will the model prediction in those parallel universes be consistent?

Simulate it then.

It’s obvious that the dominant error for the linear case comes from systematical deviation from the truth (large bias), and the dominant error for the polynomial case comes from the unreliable prediction that is highly sensitive to the training data (large variance).

5. The bias-variance tradeoff

Our choice of model has to respect the nature of the data. A linear model is too simple to capture the full range of dependency. An overcomplicated model is highly sensitive to the training set and thus can hardly generalize to new data.

Let’s examine the bias and variance for models with varying complexity.

6. The splitting of train, validation, and test set

A simple model has a large Bias and an over-complicated model causes large model variance. The sweet spot is somewhere in the middle. Parameters controlling the model complexity are part of the hyperparameters. Hyperparameter is an umbrella term that covers almost everything tunable outside the model itself ( regularization lambda, batch size, learning rate, number of layers in the NN, etc). How to select appropriate hyperparameters? Validate it!

Here comes the full recipe.

The data: Split the data set into the training, validation, and test sets, for instance, in 60%, 20%, and 20%.
Hyperparameters: Construct a range of models with various model hyperparameters, e.g. the level of complexity.
Optimization: Train each model on the training set
Model selection: Evaluate and select the appropriate hyperparameters. This is done on the validation set and methods such as using k-fold cross-validation to gain a fair comparison across models.
Performance: Judge the final model on the test set, this should be a fair evaluation since it has never seen test data.

Speaking in the Bayesian term, the validation set (dev set, hold-out, its naming seems not well regulated) is for adjusting the prior belief (what type of model and how complex the model), and the training set is for finding the posterior by optimizing something wrt the parameters. Test data call for the final verdict, based on which we decide on the prior(a linear regressor, a tree, or a neural network model).

7. No free lunch theorem — the limitation of machine learning algorithms that force to generalize on unseen data.

People keep quoting the No-free-lunch thing in various contexts. Well, I recently made the connection with the overfitting/underfitting of ML models. No free lunch theorem states that any two optimization algorithms are equivalent across all possible problems (Wolpert and Macready 1997). If one model is superior to another, one should limit the claim to the specific data set encountered so far. Inferring universal rules from a finite set of examples is deemed to be illogical.

Take home messages:

Data are finite and noisy.
All models are wrong.
Keep the tradeoff in check when designing a model and trying to overcome the above two issues.
The tradeoff is about the model, not the data.
The validation set is for hyperparameter (model prior) tuning, and the training set is for parameter (model posterior) tuning.

Reference:

My favorite article on this topic: Probabilistic machine learning and artificial intelligence by Professor Ghahramani.

These Three Theories Help Us Understand Overfitting and Underfitting in Machine Learning Models

Edit description

pub.towardsai.net

Chapter 8 Bias-Variance Tradeoff | R for Statistical Learning

Consider the general regression setup where we are given a random pair \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\)…

daviddalpiaz.github.io