The What, Why and How of Bias-Variance Trade-off

Jaydeep Hardikar
The Startup
Published in
10 min readJun 28, 2019

Building an effective Machine Learning model is all about striking the right balance between Bias (Underfitting) and Variance (Overfitting) but what are Bias and Variance ? What Bias and Variance mean intuitively ? Let’s take a step back and understand the terms Bias and Variance on a conceptual level and then try to relate these concepts to Machine Learning.

What exactly are the Bias and Variance ?

Let’s say that we want to predict rent of two bedroom apartment in the Bay Area in California. Let’s pretend to be a novice and suppose that we have no idea about how expensive the apartments are in Bay area. Being a novice, I make a very simplistic assumption that the rent in Bay area will be similar to average rents across California. I take rents of 100 apartments in California and take the average to predict the rent of 2 bedroom apartment in the Bay area in California. Now, the problem here is that, since I have taken 100 apartments across California, it may include an apartments in expensive area such as San Francisco city, South bay, Los Angeles, moderately expensive areas such as East Bay and even cheaper localities such as Bakersfield. Since I made a very simplistic assumption that the rent of 2 bedroom apartment in Bay area will be similar to that average rents across California, my prediction will not be accurate. This is because the assumption I made gave me more generic results. This excess generalization is nothing but High Bias. Thus, when we make very simplistic assumption, we introduce High Bias which leads to generic results. This is nothing but Underfitting.

Conversely, if I try to be very specific and take an average of rentals from 100 apartments in a very specific locality say East bay then I would get the results which will represent rents in East bay but the results will not represent rentals in other locations of Bay area such as South Bay or San Francisco city. Here, since I am being very specific I am introducing High Variance in my model. Due to this, if I have to predict rent of any apartment in East Bay I can safely use the average rental that I calculated but if someone asks me to predict rent of any apartment in San Francisco City, I just can’t rely on my model (average rent of 100 apartments in East Bay) because my data had rents only from East bay area. Thus my model will fail if I change the query data point (test data) from East Bay to San Francisco city. This is nothing but Overfitting.

Why there is a tradeoff between Bias and Variance ?

From the above examples, it is clear that Bias means being more generic and Variance means being more specific. Being Generic and being Specific are exactly opposite of each other. If I am making very simplistic assumption and taking average of rentals across California my model is too generic and it can not accurately predict rentals of very expensive area or very cheap area. On the other hand, If I take the average of rentals only from East bay area, my model is too specific and it can not be applied to predict rentals from other areas. Therefore, the ideal way is to strike a balance and take average rentals of 2 Bed apartments across the Bay area.

How to strike a balance between Bias and Variance ?

Most of us know that, Regularization is the way to control Bias and Variance but instead of directly jumping on to the mathematics of Regularization, let’s first understand what happens when we apply regularization. Regularization essentially reduces the complexity in the model either by getting rid of the complex features or reducing their importance. Mathematically, the term complexity refers to features represented by higher degree polynomials in the Regression equation. For example if the price of a house is based on 4 features which are Location (X1), Number of bedrooms(X2), Year of Construction(X3), Nearby School ranking(X4) etc. then to predict the price of the house we may end up using some function similar to the one shown below.

When regularization is applied, regularization term reduces the importance of features especially the higher degree polynomial features such as X3 and X4. Mathematically, reducing the importance of higher degree polynomials is nothing but reducing the weights of the higher degree polynomials (1.4, 4.5 in the function above). Reducing the importance of higher degree polynomials results in making the relationship curve between a feature and the predicted variable smoother thereby increasing generalization. This is depicted in below diagram.

As shown in the above graphs, Without Regularization, the model has tried to accommodate almost every data point in the training data. This means that the model is being too specific. It also implies that the model has a very low training error because almost all the points are falling on the curve line. However, if this model is tested on the test data, there is a possibility that many points will not fall on the curve because the model is not generic enough. Therefore, a small change in the input data (i.e. difference between train and test data) will impact performance of the model. Thus, the model has high variance and model will result in overfitting (low training error and high test error).

With Regularization, the model is skipping 2–3 points in the training data but it is generic enough to accommodate the changes in train and test data. Therefore, by using regularization, we have made the model generic which prevents overfitting.

How does regularization reduce the weights ?

Now, since we have understood how regularization generalizes the model let’s see mathematically how regularization reduces the weights of the features.

To understand how regularization works, let’s take into consideration the optimization objective of Linear Regression.

Let’s understand this equation in detail

W* is the value of weight that we want to find out as a result of optimization problem. Here, the Optimization problem is to find out the minimum value of W

n is the total number of datapoints

f(Xi) is the predicted value of Yi

Yi — is the actual value of Yi

m is the total number of features

Wj are the weights of features.

λ is the hyperparameter to control the Bias-Variance tradeoff

From the previous discussion we understood that, to reduce the variance we have to reduce the weights. Now, the question is how does this equation and especially the regularization term

reduces the weights of the features ? As shown in the optimization equation, the equation is an optimization problem with the objective of reducing weights Wj. Now, if we have the λ value very high, the optimization algorithm (Gradient Descent) will have to further reduce the values of Wjs to reduce the value of product-

Thus, greater is the value of λ, smaller will be the weights and smaller will be the importance of the feature leading to greater variance. Conversely, smaller is the value of λ, greater will be the value of weights and greater is the importance of the feature leading to greater variance. Thus, λ acts as a hyperparameter to control the Bias- Variance trade-off.

Please note that the above example is taken for Regression model but regularization can be applied to Classification models also.

L1 and L2 Regularization

In the above equation we have used the regularization term -

This is L2 regularization or Ridge Regularization. We can also use L1 Regularization or Lasso Regularization which adds absolute value of weights instead of square of weights. With L1 regularization above equation becomes

Although the basic logic of reducing the weights is same in both L1 and L2 regularization methods, there are many differences in L1 and L2 regularization. I will not go into the differences in detail in this blog but one major difference worth mentioning here is that the L1 regularization makes the weights of unimportant features 0 while L2 regularization makes them very small but not necessarily 0.

Is the regularization only way to control Bias-Variance Tradeoff ?

Regularization is certainly one of the most important levers to control Bias-Variance trade-off. However it is not the only way. Here are some other ways to control Bias-Variance.

Features — Having too many features may introduce high variance and results in overfitting. Although Regularization helps in reducing the importance of features, it is always a good idea to review the importance of features using Exploratory Data Analysis (EDA) or by using domain knowledge. If a feature is adding little value in predicting the dependent variable it makes sense not to use that feature. On the other hand, if the model is underfitting, it means that the model is not learning enough to fit sufficient number of points by finding the right hyperplane. In such a scenario, more features should be added in the model.

Randomization — Introducing randomization in the input data fed to Machine Learning model is another technique to control the Bias- Variance Tradeoff. This is mainly used in Deep Neural networks by controlling the value of Dropout rate. Neural Network consists of network of neurons (activation units) and Neural Networks learns iteratively. To randomize the input data, a fixed number of neurons are randomly dropped in each iteration from the network. The percentage of Neurons that will be dropped in each iteration is called the dropout rate. This dropping of neurons introduces randomization in the input data that traverses through the neural network.

Since in each iteration, any input data value can be eliminated, the network does not get over influenced by any of the feature and it evenly distributes weight matrix across features. More dropout rate introduces more randomization resulting in lesser variance thereby reducing overfitting. However too large dropout rate results in underfitting.

Apart from Dropout strategy in Deep Neural Networks, randomization is also used in other algorithms such as Random Forest. In Random Forest, the dataset is sampled in k different datasets and each dataset is fed to k base learners (decision trees).

Number of Training Records — Increasing the number of training records generally tends to help reduce Variance. However, it also depends on the quality of data. For example, if the test data has some particular type of data which is not at all present in training data then overfitting is bound to happen because the model has not learnt a particular type of data in the training. In such a case, if we just add more data records in training data but do not add the type of data which was actually missing from the training dataset earlier, then this would not help in reducing variance.

If the model is having high bias,then adding data does not help beyond a certain point. As shown in the graph below, the training error remains almost constant after certain point as we increase the number of training records.

Early Stopping — Early stopping is another way to avoid overfitting. Early stopping is used in Neural Network and Tree based algorithms. As shown in the graph below, in Neural networks it may happen that the difference between the training error and cross validation error reduces as the number of epochs increases. It reaches its minimum value on certain epoch and then increases in subsequent epochs. In early stopping, we stop the algorithm on the epoch (or the next epoch) showing minimum difference between train and cross validation error.

In Decision trees, the variance increases as the depth of the tree increases. To make sure we don’t overfit, the growth of the decision tree is stopped at the optimum depth. This is typically achieved in two ways. First is stopping the tree growth if the number of data points at a node (i.e. number of data points satisfying the condition on the node) is less than some specific threshold value. In this case, having too few number of data points on a sample indicate that the tree might be picking up some noisy points or outliers. The second way is to stop the growth of the tree at length which gives minimum cross validation error.

Choice of Machine Learning Algorithm — Bias -Variance trade-off can be controlled using regularization and other means in all machine learning algorithms. However, some ML algorithms such as Deep Neural networks, tend to overfit more. This is because, Deep Neural networks are typically used in applications having huge number of features (e.g. Computer Vision).

Bagging (Random Forest) and Boosting (Gradient Boosted Decision Trees) are algorithms that inherently reduce Variance and Bias respectively. In Random Forest, the base learners (Decision Trees) are of higher depth which make the base learners more prone to overfitting(i.e. high variance) but because of the Randomization, the overall variance on the aggregate level reduces significantly. On the other hand, Boosting algorithms such as Gradient Boosted Decision Trees (GBDT) use shallow base learners which are more prone to underfitting (i.e. high bias) but it reduces the bias on aggregate level by sequentially adding and training simple (shallow decision trees with less complexity) base learners.

Okay, I think we have discussed many points related to Bias -Variance trade-off. Before I end this Blog, here is a quick summary of Bias Variance Trade-off in some of the important Machine Learning algorithms.

References -

https://www.displayr.com/machine-learning-pruning-decision-trees/

https://www.jeremyjordan.me/deep-neural-networks-preventing-overfitting/

https://www.quora.com/What-is-meant-by-high-variance-low-bias-and-why-does-this-lead-to-overfitting-in-machine-learning

https://www.quora.com/What-is-the-best-way-to-explain-the-bias-variance-trade-off-in-layman%E2%80%99s-terms

--

--

Jaydeep Hardikar
The Startup

Consultant with experience in implementing projects in AI and Salesforce. 5x Salesforce, 2x GCP certified. Experience in building and managing large teams.