Bias and Variance in Machine Learning

11 min readFeb 9, 2024

Machine learning is a branch of Artificial Intelligence, which allows machines to perform data analysis and make predictions. However, if the machine learning model is not accurate, it can make predictions errors, and these prediction errors are usually known as Bias and Variance. In machine learning, these errors will always be present as there is always a slight difference between the model predictions and actual predictions. The main aim of ML/data science analysts is to reduce these errors in order to get more accurate results.

What is Bias?

While making predictions, a difference occurs between prediction values made by the model and actual values/expected values, and this difference is known as bias errors or Errors due to bias. It can be defined as an inability of machine learning algorithms such as Linear Regression to capture the true relationship between the data points. Each algorithm begins with some amount of bias because bias occurs from assumptions in the model, which makes the target function simple to learn.

Low Bias: A low bias model will make fewer assumptions about the form of the target function.
High Bias: A model with a high bias makes more assumptions, and the model becomes unable to capture the important features of our dataset. A high bias model also cannot perform well on new data.

Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm often has low bias.

Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Ways to reduce high bias in Machine Learning:

Use a more complex model: One of the main reasons for high bias is the very simplified model. it will not be able to capture the complexity of the data. In such cases, we can make our mode more complex by increasing the number of hidden layers in the case of a deep neural network. Or we can use a more complex model like Polynomial regression for non-linear datasets, CNN for image processing, and RNN for sequence learning.
Increase the number of features: By adding more features to train the dataset will increase the complexity of the model. And improve its ability to capture the underlying patterns in the data.
Reduce Regularization of the model: Regularization techniques such as L1 or L2 regularization can help to prevent overfitting and improve the generalization ability of the model. if the model has a high bias, reducing the strength of regularization or removing it altogether can help to improve its performance.
Increase the size of the training data: Increasing the size of the training data can help to reduce bias by providing the model with more examples to learn from the dataset.

What is a Variance Error?

In simple words, variance tells that how much a random variable is different from its expected value. Ideally, a model should not vary too much from one training dataset to another, which means the algorithm should be good in understanding the hidden mapping between inputs and output variables. Variance errors are either of low variance or high variance.

Low variance means there is a small variation in the prediction of the target function with changes in the training data set. At the same time, High variance shows a large variation in the prediction of the target function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with the training dataset, and does not generalize well with the unseen dataset. As a result, such a model gives good results with the training dataset but shows high error rates on the test dataset.

Some examples of machine learning algorithms with low variance are, Linear Regression, Logistic Regression, and Linear discriminant analysis. At the same time, algorithms with high variance are decision tree, Support Vector Machine, and K-nearest neighbours.

Ways to Reduce the reduce Variance in Machine Learning:

Cross-validation: By splitting the data into training and testing sets multiple times, cross-validation can help identify if a model is overfitting or underfitting and can be used to tune hyperparameters to reduce variance.
Feature selection: By choosing the only relevant feature will decrease the model’s complexity. and it can reduce the variance error.
Regularization: We can use L1 or L2 regularization to reduce variance in machine learning models
Ensemble methods: It will combine multiple models to improve generalization performance. Bagging, boosting, and stacking are common ensemble methods that can help reduce variance and improve generalization performance.
Simplifying the model: Reducing the complexity of the model, such as decreasing the number of parameters or layers in a neural network, can also help reduce variance and improve generalization performance.
Early stopping: Early stopping is a technique used to prevent overfitting by stopping the training of the deep learning model when the performance on the validation set stops improving.

Different Combinations of Bias-Variance:

There are four possible combinations of bias and variances, which are represented by the below diagram:

Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning model. However, it is not possible practically.
Low-Bias, High-Variance: With low bias and high variance, model predictions are inconsistent and accurate on average. This case occurs when the model learns with a large number of parameters and hence leads to an overfitting
High-Bias, Low-Variance: With High bias and low variance, predictions are consistent but inaccurate on average. This case occurs when a model does not learn well with the training dataset or uses few numbers of the parameter. It leads to underfitting problems in the model.
High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate on average

Bias-Variance Trade-Off:

While building the machine learning model, it is really important to take care of bias and variance in order to avoid overfitting and underfitting in the model. If the model is very simple with fewer parameters, it may have low variance and high bias. Whereas, if the model has a large number of parameters, it will have high variance and low bias. So, it is required to make a balance between bias and variance errors, and this balance between the bias error and variance error is known as the Bias-Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other:

If we decrease the variance, it will increase the bias.
If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that accurately captures the regularities in training data and simultaneously generalizes well with the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high variance algorithm may perform well with training data, but it may lead to overfitting to noisy data. Whereas, high bias algorithm generates a much simple model that may not even capture important regularities in the data. So, we need to find a sweet spot between bias and variance to make an optimal model.

Overfitting and Underfitting in Machine Learning:

Overfitting and Underfitting are the two main problems that occur in machine learning and degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well. Here generalization defines the ability of an ML model to provide a suitable output by adapting the given set of unknown input. It means after providing training on the dataset, it can produce reliable and accurate output. Hence, the underfitting and overfitting are the two terms that need to be checked for the performance of the model and whether the model is generalizing well or not.

Overfitting:

Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model. The overfitted model has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our model. It means the more we train our model, the more chances of occurring the overfitted model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the linear regression output:

As we can see from the above graph, the model tries to cover all the data points present in the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression model to find the best fit line, but here we have not got any best fit, so, it will generate the prediction errors.

Underfitting:

Underfitting occurs when our machine learning model is not able to capture the underlying trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to which the model may not learn enough from the training data. As a result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear regression model:

As we can see from the above diagram, the model is unable to capture the data points present in the plot.

Good Fit in a Statistical Model:

Ideally, the case when the model makes the predictions with 0 error, is said to have a good fit on the data. This situation is achievable at a spot between overfitting and underfitting. In order to understand it, we will have to look at the performance of our model with the passage of time, while it is learning from the training dataset.

With the passage of time, our model will keep on learning, and thus the error for the model on the training and testing data will keep on decreasing. If it will learn for too long, the model will become more prone to overfitting due to the presence of noise and less useful details. Hence the performance of our model will decrease. In order to get a good fit, we will stop at a point just before where the error starts increasing. At this point, the model is said to have good skills in training datasets as well as our unseen testing dataset.

Regularization:

When it comes to training models, there are two major problems one can encounter: overfitting and underfitting.

Overfitting happens when the model performs well on the training set but not so well on unseen (test) data.
Underfitting happens when it neither performs well on the train set nor on the test set.

Particularly, regularization is implemented to avoid overfitting of the data, especially when there is a large variance between train and test set performances. With regularization, the number of features used in training is kept constant, yet the magnitude of the coefficients (w) is reduced.

There are different ways of reducing model complexity and preventing overfitting in linear models. This includes ridge and lasso regression models.

Ridge Regression:

Ridge regression is a type of regularized regression model. This means it is a variation of the standard linear regression model that includes a regularized term in the cost function. The purpose of this is to prevent Overfitting. Ridge Regression adds an L2 regularization term to the linear equation. That’s why it is also known as L2 Regularization or L2 Norm.

The main aim of Ridge Regression is to reduce Overfitting.

How does Ridge Regression work?

Ridge Regression works by adding a penalty term to the cost function of a linear regression model, called the regularization term. This regularization term prevents the model from overfitting by penalizing the large coefficients. The regularization parameter determines how much the model should be penalized. By increasing the regularization parameter, the coefficients are shrunk toward zero.

If λ = 0 then the cost function of Ridge Regression and the cost function of linear regression is the same.
If even adding the hyperparameter to the original cost function the training accuracy won’t improve then it will keep on changing the value of λ and try to find the best λ parameter.
As the cost function is changing after adding a hyperparameter, the model will also change the best-fit line.

Relationship b/w λ and slope:

The slope is inversely proportional to the λ. This means as the hyperparameter λ increases slope θ decreases and vice-versa.

Lasso Regression:

Lasso regression, commonly referred to as L1 regularization, is a method for stopping overfitting in linear regression models by including a penalty term in the cost function. In contrast to Ridge regression, it adds the total of the absolute values of the coefficients rather than the sum of the squared coefficients.

It is used to reduce the features because in this case slope can become zero.

Here, with the increase in the value of λ, the slope θ is decreasing.
But when we will increase the value of λ again and again, at one point the slope will become 0 (hence feature will be removed), and even after that if we will increase the value of λ, the slope will be stuck at 0.

Difference b/w Ridge and Lasso Regression:

The main difference between Ridge and Lasso regression is the way they shrink the coefficients. Ridge regression can reduce all the coefficients by a small amount but Lasso can reduce some features more than others and hence can completely eliminate those features.

Conclusion:

Ridge and Lasso’s regression are a powerful technique for regularizing linear regression models and preventing overfitting. They both add a penalty term to the cost function, but with different approaches. Ridge regression shrinks the coefficients towards zero, while Lasso regression encourages some of them to be exactly zero. These techniques can be implemented easily in Python using scikit-learn, making it accessible to a wide audience. By understanding and implementing Ridge and Lasso regression, you can improve the performance of your linear regression models and make more accurate predictions on new data.

Thanks