MACHINE LEARNING

How Bias and Variance Affect a Machine Learning Model

Understand these two must-know topics when building an algorithm

Ismael Araujo
The Startup

--

Every day 2.5 quintillion bytes of data is created. Between 2016 and 2018 about 90% of all the data in the world was created. This immense amount of data is the primary motivation for companies to invest in gathering this data. This way, these companies can transform this data into creating and improving new products. With such an extreme amount of data, it’s easy to miss the basics when working on a model prediction. In this article, let’s go back and analyze some of the basics of Machine Learning and analyze what the bias and variance tradeoff is and how we can avoid it.

Bias and Variance Tradeoff

In machine learning, bias is the algorithm tendency to repeatedly learn the wrong thing by ignoring all the information in the data. Thus, high bias results from the algorithm missing relevant connections between features and target variables. Using our dart analogy, we are consistently missing the target. The model doesn’t understand what it should be aiming for, and its predictions are systematically off-base.

Variance refers to an algorithm’s sensitivity to small changes in the training set. High variance is a result of the algorithm fitting to random noise in the training set. Back to our dart analogy, high variance means that the dart throws are consistently imprecise and all over the place. Imagine someone playing dart for the first time having no ability and then you will be able to picture what high variance would look like. This means that the model is highly tuned to the data that it has seen, so it can nail predictions for samples that are identical to what it has seen before. However, it does poorly on examples that don’t have the same similarities to the training set.

Both bias and variance are connected to the model’s complexity. Low complexity means high bias and low variance. Increased complexity means low bias and high variance.

Underfitting and Overfitting

Underfitting happens when the model is too simple and with high bias and low variance and cannot capture the trend in the data, which results in a high total error. Therefore, the model can’t understand what is going on in the training set and ignores essential pieces of information that would help to make accurate predictions. On the graphic below, we can see that the model doesn’t capture the cluster on the bottom left of the graphic (circles), and it only assigns a straight line.

Overfitting is the extreme opposite of underfitting, which happens when the model is very complex and fits too closely to a limited set of data. In this case, the model memorizes the examples seen, and it can make highly accurate predictions in this training set. However, when applied to a different data set, it will likely make an inferior prediction because it didn’t learn how to deal with different patterns; it just memorized examples. This will cause low bias and high variance.

Finding the Optimal Tradeoff

Looking at the graphic below, we can see underfitting on the left, where we have high bias and low variance. On the right, we have low bias and high variance. The goal is to find a model that fits somewhere in the middle of the graphic, where we see the Optimum Model Complexity and achieve the Minimum Total Error. We need to find a model that has a medium complexity, where it can identify the patterns in the training data but not memorize every example in the data. If you a have low test error in both training and test set, you have a good model. However, if you have a high test error, you can look at the training error to understand whether you are overfitting or underfitting. Then you can work from there to improve your model.

Hyperparameter tuning

First, let’s understand the difference between parameter and hyperparameter. A model parameter is a configuration variable whose value can be estimated from data. Parameters are required by the model to make predictions. A hyperparameter is a configuration that is external to the model, whose value cannot be estimated from data. The value guides how the algorithm learns parameter values from the data.

For example, imagine that you are trying to predict if a house costs more or less than $300K using a decision tree based on features such as squared footage, location, number of bedrooms, zip code, if it’s close to the water or not, etc. We call this parameter. Now, maybe some of these features could overfit our model, so we can use hyperparameters to decrease the number of features to consider and the max depth of the tree. In the graphic below, we can see that when we adjust the hyperparameters to use only a tree depth of 3, our model encounters the lowest test error.

Regularization

Regularization is a technique used to reduce overfitting by discouraging overly complex models in some way. The main goal is to allow enough flexibility for the algorithm to learn the patterns in the data and provide protection, so the model will not get more complex then it needs, avoiding it from memorizing the training data. For example, we can use Ridge or Lasso Regression, which will add a penalty to the loss function to constrain coefficients.

We went over some of the most important model optimization concepts of one of the most interesting fields nowadays. To learn more key concepts, I highly recommend the Machine Learning Crash Course offered by Google for free.

--

--

Ismael Araujo
The Startup

Top 1,000 Writer on Medium | Data Scientist | Based in NYC | Writer | http://bit.ly/linkedin-ismael