What is Bias and Variance in Machine Learning?

Published in

CodeX

5 min readApr 7, 2022

Let's deep dive to understand Bias and Variance along with Underfitting and Overfitting

Machine Learning is a subset of Artificial Intelligence and is growing rapidly in different fields. Machine Learning learns from the data fed into its model, which helps it to make better predictions over time. ML models require diverse and massive amounts of data to make meaningful predictions.

Due to enormous size and real-world limitations, there will always be errors, which leads to a deviation between the predicted and the actual results. The main goal of Data Scientists is to minimize the errors to make more accurate predictions.

Machine Learning Errors:

There are two types of Errors in ML :

#1 Reducible Error:

Bias and Variance present in the dataset are referred to as reducible errors since they can be adjusted and tweaked and adjusted to a certain extent to improve the accuracy of the model

# 2 Irreducible Error:

There are some errors that will always be present in the dataset, no matter what you do. For example, there may be unknown variables whose value cannot be altered or reduced.

Irreducible errors cannot be altered and Data Scientists need to work around this limitation.

Bias:

Bias refers to the difference between the average Predicted value and the expected value. High bias is referred to as a phenomenon when the model is oversimplified, the ML model is unable to identify the true relationship or the dominant pattern in the dataset.

Every model has an inbuilt bias, as it helps the model to learn in a quicker and easier way. High bias causes underfitting in the model.

Linear algorithms, in general, have a high bias, which enables them to learn quickly. Whereas nonlinear algorithms have a lower bias since they are more complex than linear models. Simply put, the simpler is the algorithm, the more bias in the model.

Following are the characteristics of a high biased data model

#1 Unable to capture the trends

#2 High Error Rate

#3 Underfitting

#4 Oversimplified/Overgeneralized model

Variance :

Variance measures the change in the value /output if a new different training dataset was used. In an ideal situation, the model should not differ for different training datasets. The variance comes into the picture when Data Scientists use complex models with multiple features.

High variance causes overfitting, which captures more data points than required along with noise. Whereas, the model with low variance has minimal difference between the sample model and the predicted model.

A model with high variance performs well on the training dataset but fails to perform as per expectation when provided with unseen data.

Linear Regression and Logistics regression model have low variance, whereas, decision trees, support vector machines, and k nearest neighbors have high variance inbuilt in them.

Characteristics of High Variance Model

#1 High Complexity

#2 Maps all data points close to each other

#3 Overfitting

#4 Noise in the data set

Underfitting and Overfitting:

Now that we have understood Bias and Variance, let's understand what Overfitting and Underfitting are all about

Overfitting and underfitting are two problems that plague every Machine Learning model. The optimal Machine Learning model should be able to adapt to all unknown inputs and provide a reliable output each time.

Overfitting

Overfitting refers to a situation when Data Scientists train the ML model with a lot of data. Metaphorically think of a slim individual wearing loose oversized clothes!

When a model gets trained by large data, it starts covering more data points than is required, and in this process, it starts integrating noise and inaccurate values as well

The overfitted model has high variance and low bias. Supervised learning algorithms suffer from overfitting all the time.

By Chabacano — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=3610704

What causes Overfitting:

Below are some of the reasons mentioned that cause overfitting

#1 High Variance in the ML Model

#2 High Complexity of the model

#3 Using unclean and unstructured data

#4 Inadequate training dataset

How to rectify Overfitting

#1 Train the model with adequate data

#2 Implement Regularization Techniques

#3 Apply K Fold cross-validation

#4 Removing features

#5 Ensembling Techniques

Underfitting:

Underfitting is a phenomenon that happens when the ML model is not able to identify the trends of the data

Metaphorically, think of a healthy individual trying to fit in an undersized dress.

The model is unable to learn from the training data to make reliable and accurate predictions. It happens due to high bias and low variance

By AAStein — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=112896907

What causes Underfitting:

# High Bias and Low variance in the dataset

# 2 Simplistic model used for prediction

#3 Unclean data used for making predictions

#4 Inadequate size of the training dataset

How to rectify Underfitting

#1 Make the model more complex

#2 Increase the features and duration of the training data set

#3 Eliminate noise from the dataset

Good Fit:

The ideal situation is when the predicted values match with the actual values in the dataset and record no errors. However, in real life, this is impossible to achieve. The optimal solution is to find a middle path that helps to obtain the desired output.

By continuously training the model, errors in the training dataset reduce over time. The same thing happens with the test dataset. If you continue testing the training dataset, it will eventually start capturing the noise as well and lead to overfitting.

By AAStein — Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=113123580

We need to be vigilant and observe the pivotal point where the errors start increasing. At this moment we need to halt the training. This trained model is assumed to be a good fit and can make valid predictions.