Photo by Matthew Henry from Burst

Bias-Variance Trade-Off 101

by Tess Leggio

Opex Analytics
6 min readAug 23, 2019

--

The Opex 101 series is an introduction to the tools, trends, and techniques that will help you make the most of your organization’s data. Intended for business leaders and practitioners alike, these posts can help guide your analytics journey by demystifying essential topics in data science and operations research.

Imagine that you’ve developed a model that predicts scores of baseball games. You’ve fed it every stat you could get your hands using last season’s data, and the early results are promising: it has near-perfect performance during model training on key evaluation metrics! Excited by these early returns, you test it out on tonight’s slate of games to see how many winners you successfully predict… but you totally strike out (pun deeply intended).

How did this happen? To answer that, it’s helpful to understand the trade-off between the two main types of reducible model error: bias and variance.

Let’s Talk About Errors

Model error can actually be broken down into three distinct parts:

Let’s tackle these one by one.

Irreducible Error

To fully understand bias and variance, we have to contrast them with irreducible error, or error resulting from noise in the dataset. This noise can arise from a missing input variable (e.g., perhaps your data lacked information on player injuries, which certainly impact baseball scores), measurement errors, or related issues intrinsic to the phenomenon being modeled. True to its namesake, this error unfortunately can’t be eliminated through good modeling practices.

Mathematically, the irreducible error (e) is the difference between the measured response (Y) and the value predicted by the true relationship (𝑓(𝑥)).

The remaining reducible error consists of errors due to bias and variance.

Bias

Bias is the inability of a model to learn enough about the relationship between the predictors X and the response Y that’s implied by the dataset. A model with bias will, across multiple training sets, have some regions with positive errors and some regions with negative errors. Models with high bias underfit the data, creating a high training error, and oversimplify the relationship between the predictors and the response.

Let’s take a toy version of our baseball problem where the goal is to predict a team’s runs scored in a given game using their overall batting average. Let’s caveat that a team’s number of runs scored increases with its batting average, but gradually plateaus. Fitting a linear model to estimate the curved “true” relationship between batting average and runs will result in a biased model, because the model oversimplifies the true relationship with a straight line. At some batting averages, the model will tend to underestimate a team’s runs, and at other batting averages, the model will tend to overestimate them.

Mathematically, the error due to bias is the difference between the expected value of the model’s prediction (E[f̂(𝑥)]) and the value predicted by the true relationship (𝑓(𝑥)).

Variance

By contrast, variance quantifies a model’s tendency to learn too much about the relationship that’s implied by the dataset. A high-variance model tends to absorb so much that it captures a lot of the randomness in the data; it goes beyond the true signal into the noise, a phenomenon known as overfitting. Models that are overfit tend to unnecessarily complicate the relationship between the predictors and the response, and therefore tend to generalize poorly to other datasets, consequently creating high test error (relative to the training error).

A good way to think about variance is that it represents a model’s lack of consistency across datasets. If you fit a high-variance model with different datasets, the errors may not tend to over- or underestimate the response on average, but the model’s error at a specific point may change drastically depending on the data used for training.

Given what we know about our baseball model, it seems likely that we overfit to last year’s results, which would explain why the model isn’t generalizing to this year’s outcomes.

Mathematically, the error due to variance is the expected squared difference between the model’s prediction on a single training dataset (𝑓(𝑥)) and its average prediction across all training sets (E[f̂(𝑥)]).

Making the Trade-Off

In an ideal scenario, we would be able to develop the perfect model using infinite training data, thereby eliminating all error due to bias and variance. In practice, however, we must make trade-offs between simplifying the modeled relationship (reducing variance, but potentially introducing bias) and trying to capture more of it (reducing bias, but potentially introducing variance). Finding the sweet spot helps us minimize the model’s total overall error.

Photo Credit: Scott Fortmann-Roe

This chart is a great way to visualize bias and variance in two dimensions. In higher dimensions, we diagnose bias/variance issues via training and test datasets. For more information on training and test datasets, check out the video linked here.

Cheat Sheet

Bias and variance are ever-present problems for data scientists, but there are really only four possible combinations that could characterize a model. Here’s a quick reference to help you interpret your model performance and capture as much of the key relationships as you can.

LOW BIAS, HIGH VARIANCE

The model is overfitting the training data and therefore does not generalize well to unseen data.

Things to try: feature selection, dimensionality reduction, a simpler model, regularization, hyperparameter tuning, bagging and resampling techniques, training on a larger dataset

HIGH BIAS, LOW VARIANCE

The model is underfitting the data and does not capture enough of the true relationship between the predictors and the response.

Things to try: adding features, a more flexible model, hyperparameter tuning

HIGH BIAS, HIGH VARIANCE

Yikes! Address the bias problem first, then move on to the variance problem if it’s still an issue.

Things to try: everything

LOW BIAS, LOW VARIANCE

Good job!

Putting It All Together

Understanding the trade-off between bias and variance can help you minimize overall error and train a model that generalizes appropriately. Identifying and reducing both bias and variance is vital to building a robust, high-performing model that solves your problem.

If you enjoyed this Opex 101 entry, check out our other posts on random forest models, Bayes’ Theorem, and multi-armed bandits.

If there’s a topic you’d like us to cover as part of Opex 101, let us know in the comments below!

_________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.

--

--