Understanding the Bias-Variance Tradeoff

Emily Strong
The Data Nerd
Published in
5 min readJun 30, 2022

The Bias-Variance Tradeoff is a concept that is easy to skip over when first starting out in machine learning, and yet it’s one of the most fundamental principles of the field (and is an interview-question favorite).

To understand this tradeoff, we first need to understand model error.

Every machine learning model has error. (If your model doesn’t have any errors it’s actually a very bad sign of data leakage.) The test data error has three components to it:

  1. Noise: The random noise intrinsic to the data set.
  2. Bias: Systematic errors due to the simplifying assumptions of the model.
  3. Variance: How much the predictions vary between models trained on different samples of the training data.

The noise of the data is model independent and will always contribute to the total error.

Bias on the other hand is directly related to the algorithm used, though all machine learning algorithms have it. This is easy to conceptualize for linear regression, in which there is a fundamental assumption of a linear relationship between the features and the target variable. However, many features do not have a linear relationship and unless an appropriate transformation is applied (e.g. converting to a log scale), the linear regression will underfit those features as a result. Underfitting is the failure to learn the relationship between features and target.

Other more complex algorithms have simplifying assumptions as well. For a decision tree, this often occurs from the hyperparameters such as setting a minimum number of data points per leaf, or setting a maximum depth. These hyperparameters improve how well the model generalizes to new data, but they do so by simplifying the model. A random forest adds further simplifications to the trees within it through sampling. Each tree is trained on a bootstrapped sample of the data, and each split point can only consider a sample of the features. By having less data available for the individual trees and split points, we have a simpler model that generalizes better, but has more bias. Additionally, any use of regularization simplifies a model by constraining the parameters which increases the bias. A ridge regression thus has greater bias than a linear regression.

Variance represents the interaction between the data and the algorithm in the error. Any algorithm trained on a different sample of the training data will learn a different model, whether it is the values of parameters, or the data itself in a non-parametric model. However some algorithms will have greater sensitivity to fluctuations in the data and tend to overfit, learning the noise of the data instead of generalizing.

In the graph below, we can see the concepts of overfitting and underfitting illustrated. In this example, we have two classes with a parabolic decision boundary between them. Some of the data points don’t perfectly align with this boundary, falling on the wrong side of it.

Overfitting and underfitting
Adapted from: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg

An optimal model that generalizes well will learn the parabola despite the slight variations in some of the data.

An overfit model, that is one with high variance, will learn to map every single data point, resulting in a roughly-parabolic squiggle. With a slightly different sample of data, the shape of this squiggly line would change, and when attempting to make predictions about the test data, the ones that fall along the decision boundary will likely have a higher error rate due to the poor generalization.

An underfit model, one with high bias, will learn the relationship it is designed to assume regardless of the actual shape of the decision boundary, resulting in a straight line that poorly fits the data.

A common metaphor for thinking about bias and variance is how well the test predictions hit the accuracy bullseye. A low bias, low variance model will be tightly clustered around the center with a high accuracy. Increasing the variance increases the spread of the predictions (that is, the same test data will have variations in their predictions from different models). Increasing the bias on the other hand shifts the clustering of the predictions off-center. When we have both high variance and high bias, the model performs quite poorly.

In the bullseye visualization below, we would say that the high variance low bias model is overfitting, while the high bias low variance model is underfitting, and the high variance high bias model is just plain bad.

Bias versus variance bullseye
Image credit: http://scott.fortmann-roe.com/docs/BiasVariance.html

The concepts of bias and variance come together to give us the Bias-Variance Tradeoff. This is the idea that as model complexity increases so does the variance, while the bias decreases. We can plot this as the relation between model complexity and model error:

Diagram of model complexity vs error
Image credit: http://scott.fortmann-roe.com/docs/BiasVariance.html

The total error of the model is the sum of the variance, the square of the bias, and the error due to noise. At the extreme high and low ends of model complexity we have high total error due to the high variance or high bias. The goal of selecting the optimal model for a problem is to find one that minimizes the total error.

In real-world settings, high variance models are rarely used directly because of their tendency to overfit. Instead, we use them as the individual learners within boosting and bagging algorithms that reduce the variance (and increase the bias) to give much lower total error. However, high bias models like logistic or ridge regression are common. Why is that?

For a few reasons: first, the simplifying assumptions make them faster to train. Second, their predictions are easier to interpret and explain. Third, they are more robust to variance when there isn’t a lot of data available for initial training, particularly if the regularization strength is high. And finally, they are more robust to drift in the data over time. These considerations can make high bias models useful in practice, though they do still need to have acceptable performance.

The Bias-Variance Tradeoff is thus a useful concept to factor into what algorithms you consider when tackling a problem.

The Bias-Variance Tradeoff and other key concepts for working with models in real-world settings are covered in my Machine Learning Flashcards: Modeling Core Concepts deck. Check it out on Etsy!

--

--

The Data Nerd
The Data Nerd

Published in The Data Nerd

The Data Nerd features stories about data science, machine learning .

Emily Strong
Emily Strong

Written by Emily Strong

Emily Strong is a senior data scientist and data science writer, and the creator of the MABWiser open-source bandit library. https://linktr.ee/thedatanerd