“Bias and Variance in Machine Learning”

Pasquale Di Lorenzo
4 min readDec 30, 2022

--

This article is part of the series :

Getting Started with Machine Learning: A Step-by-Step Guide

Bias

In machine learning, bias refers to the inability of a model to capture the true relationship between the input data and the output labels. For example, if we are trying to predict the height of cats based on their weight, and we use a linear regression model, the straight line will never be able to capture the true relationship between weight and height because it cannot curve like the true relationship. This lack of flexibility is called bias.

To illustrate this concept, let’s consider an example where we have a dataset containing the weight and height of a group of cats. If we plot this data on a graph, we might observe that lighter cats tend to be shorter, while heavier cats tend to be taller. However, after a certain weight, cats may not get any taller, but rather become more obese.

Given this data, we would like to predict the height of a cat based on its weight. Ideally, we would know the exact mathematical formula that describes the relationship between weight and height, but in this case, we don’t know the formula. So, we are going to use a machine learning method, such as linear regression, to approximate this relationship.

However, even though linear regression is a widely used method for prediction, it has a significant limitation when it comes to capturing complex relationships between variables. In this case, the straight line of the linear regression model will never be able to accurately replicate the curve in the true relationship between weight and height, no matter how well we fit it to the training data. This inability to capture the true relationship is called bias. High bias may cause underfitting.

To overcome this issue, we might consider using a more flexible model, such as a polynomial regression, which can fit curves to the data and better capture the true relationship between weight and height. However, it is important to find a balance between bias and variance, as a highly flexible model may overfit the training data and perform poorly on new, unseen data.

Variance

Variance in machine learning refers to the degree to which the model’s predictions or results vary from one sample to the next. In other words, it measures how much the model’s predictions or results differ from the average or expected value.

If our model has a high variance, it means that its predictions for the weight of a given cat will vary significantly depending on the specific height of that cat. In contrast, if the model has a low variance, its predictions for the weight of a given cat will be relatively consistent regardless of the specific height of the cat.

One way to understand this concept is to imagine two different models that are both trained to predict the weight of cats based on their height. Model A has a high variance, while Model B has a low variance. If you input the same height value into both models, Model A might predict a weight that is significantly different from Model B’s prediction. This could be due to differences in the training data that was used to create the two models, or it could be due to differences in the algorithms or approaches that were used to build the models. High variance may cause overfitting.

In general, it is generally preferable to have a machine learning model with a low variance, as this can indicate that the model is more stable and reliable. However, in some cases, a model with a higher variance may be more accurate, particularly if it is able to capture more of the complexity and variability in the underlying data.

This article is part of the series :

Getting Started with Machine Learning: A Step-by-Step Guide

--

--

Pasquale Di Lorenzo

As a physicist and Data engineer ishare insights on AI and personal growth to inspire others to reach their full potential.