The Trade-Off that Plagues all of Machine Learning

Eashan Kaushik
NYU Data Science Review
7 min readJan 25, 2022

Figure 1: Blue points represent training data, and red points represent unseen data for two variables x and y (Photo: Eashan Kaushik)

One of the first concepts that I encountered when I started my journey in Data Science was the bias-variance trade-off. This trade-off is what makes it difficult to train supervised learning algorithms, and I am sure you would have come across this term at least once while developing Machine Learning models. In essence, you want to minimize bias and variance. Sounds easy enough, right? It’s a little more complicated than we think. This article will help you build an intuition behind the tradeoff between bias and variance while helping you to better understand these terms. We will also see how bias and variance are affected by the regularization parameter, amount of training data, and polynomial features (degree of polynomial). At the end of this blog, I will also give you some pointers that I keep in mind while developing a balanced model.

Introduction

Ideally what we want is a model that accurately captures the trend in training data, but also generalizes well to test or unseen data. This is easier said than done. In reality, we see the following trends while training our model:

  1. When we try to accurately capture the trends in the training data (Figure 1-C), we are not able to generalize well to unseen data (Figure 1-F).
  2. When we develop a model to generalize well to unseen data (Figure 1-D) we end up not capturing the trends in training data (Figure 1-A).

Consequently, the model represented by Figures 1-C and 1-F has a low bias but high variance, and the model represented by Figures 1-A and 1-D has a high bias but low variance. Simply put, when we try to minimize bias we end up increasing variance and when we try to minimize variance we end up increasing bias. This is what we call the bias-variance trade-off.

Let’s talk about bias and variance with respect to the complexity of the model.

  1. When we have a simple model (Fig 1-A and 1-D), we can see that training error is high as the model fails to capture the trends in training data — high bias. However, the model will have a lower error rate on unseen data, as it still is able to generalize decently well to unseen data — low variance. In Machine Learning jargon we can say that the model is underfitting the data.
  2. When we have a complex model (Fig 1-C and 1-F), we can see that training error is low as the model perfectly captures the trends in training data — low bias. However, the model will have a much higher error rate on unseen data, as it is not able to generalize well to unseen data — high variance. This is the case of the model overfitting the data.

More formal definitions of the bias and variance are as follows:

Bias: How much the average value of the estimate differs from the true function.

Variance: How much the estimate varies around its average.

We can visualize the relationship as follows:

Figure 2: The red line represents the true function, the blue line is the model we have trained on data. The greater the length of dashes, the higher is the variance. (Photo: Sundeep Rangan, ECE, NYU Yao Wang, ECE, NYU Alyson K. Fletcher)

The next question that would have popped into your mind is well what can we do about it? Sadly not much — converging to a model with low bias and low variance is very challenging. What we can try to do is identify a model that has fairly low bias and variance. The model described in Figures 1-B and 1-E is an example of a model with “Optimal” bias and variance. In figure 3 you can see that the optimal model doesn’t have the lowest bias or variance, but meets some sort of middle ground.

Figure 3: Optimal Model for Bias Variance Trade-Off. It should also be noted that the region to the left of optimal capacity is underfitting and the region to the right of optimal capacity is overfitting (Photo: Eashan Kaushik)

Relationship with the Degree of Model

Figure 4: Blue line represents training error, and the red line represents testing error or cross-validation error (Photo: Eashan Kaushik)

As the degree of the model increases, the model becomes more complex. As a result, it fits better to the training data and the training error decreases as the degree of the model increases.

However, as the complexity of the model increases, the error rate on unseen data first decreases and then increases. The optimal degree in this case will be the one that corresponds to the local minima on the unseen error. Region (a) in the above graph (Figure 4) corresponds to high bias and region (b) corresponds to high variance.

The optimal degree of the model can be found using cross-validation, a template for finding the optimal degree is given below:

(Source)

Relationship with Regularization Parameter

A higher regularization parameter (lambda λ) means the model is simple and as a result, the training error is high, as the regularization parameter decreases the model becomes more complex and the training error reduces.

When the regularization parameter is low, the unseen error rate is high and starts to decrease when λ starts to increase. After a certain point, the error rate of unseen data starts to increase again. Try to find the high bias and high variance regions in the following graph.

Figure 5: The blue line represents training error, and the red line represents testing error or cross-validation error (Photo: Eashan Kaushik)

Region (a) in the above graph (Figure 5) corresponds to high variance and region (b) corresponds to high bias. To read more about the regularization parameter you can refer here.

Relationship with Amount of Training Data

The relationship of bias and variance with the amount of training data (m) is a bit difficult to grasp. In this case, we look at three different graphs, these graphs are also called the learning curves.

Figure 6-A is the case where we have a high bias in the model, Figure 6-B is the case when we have high variance in the model and, Figure 6-C is the case where we have a middle ground between bias and variance.

When the dataset is small, we tend to overfit the data and have a low bias. In this case, you can see that getting more data will result in both the training and test error becoming constant and remaining high. This means that getting more data will not help us tackle high bias.

Figure 6(A): High Bias — blue line represents training error, and the red line represents testing error or cross-validation error (Photo: Eashan Kaushik)

On the other hand, if the model is overfitting, the training error curve will remain well below the testing error and may not plateau. If the training curve does not plateau, this suggests that collecting more data will improve model performance.

Figure 6(B): High Variance: blue line represents training error, and the red line represents testing error or cross-validation error (Photo: Eashan Kaushik)

In the last case, we can see that both the learning and testing curves are closer to each other and have comparatively low errors. This is the case of an optimal model.

Figure 6(C): Optimal Model: blue line represents training error, and the red line represents testing error or cross-validation error (Photo: Eashan Kaushik)

Learning curves are generally drawn to help us understand what really is happening with our model. We try to identify if the model is overfitting or underfitting the data.

Conclusion

Now that you understand how to account for the bias-variance tradeoff to create an optimal model for your problem statement, I would like to give you a few brief pointers that I keep in mind when I am tackling high bias or high variance.

High Bias

  • Decrease the value of the regularization parameter.
  • Add more polynomial features (degree of the model can be selected using cross-validation).
  • Add more relevant features to your dataset (relevant features can be found using Lasso Regression).

High Variance

  • Increase the value of the regularization parameter.
  • Add more training data.
  • Remove irrelevant features (irrelevant features can be found using Lasso Regression).

References:

[1] Andrew Ng, Machine Learning by Stanford University, Coursera

[2] Josh Starmer, Machine Learning Fundamentals: Bias and Variance, YouTube

[3] Sundeep Rangan, ECE, NYU; Yao Wang, ECE, NYU; Alyson K. Fletcher, Statistics, ECE, UCLA, NYU ECE GY-6143 Machine Learning, GitHub

[4] Jason Brownlee, Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning, Machine Learning Mastery

--

--