Bias and Variance
Overview on Bias and Variance in Machine Learning
If you are familiar with Machine Learning, you may heard about bias and variance. But if not, don’t worry, we’re going to explain them in a simple way step-by-step.
Let’s use a reverse approach, we will start with a practical example and walk through until we reach the final definition.
We are going to use the Longleys Economic Regression dataset from Kaggle. It is a very simple and small dataset which will be suitable to understand our topic today.
Now let’s have a quick look on the dataset.
Note: As our goal is to discuss the concepts of bias and variance and not to solve a machine learning problem, we will consider only one feature which is the ‘population’ and use it to predict the outcome which is the ‘employed percentage’.
Here is how our dataset looks like when we plot ‘Employed’ vs ‘Population’.
First we should split our dataset into training and testing subsets.
Now we want to figure out the relation between ‘Population’ and ‘Employed’, in other words we want to build a regression model to help us predict the future employment given a certain population.
There are two possible models, a linear and a non-linear model.
Let’s explore them both and see the result.
(1) Assume Linear Relationship
1- Simple Linear Regression
The equation for this model is y = ax+b, where:
- y is ‘Employed’.
- x is ‘Population’.
- ‘a’ and ‘b’ are the slope and y-axis intercept respectively.
- ‘a’ and ‘b’ are the model parameters that need to be tuned to get the best result.
- ‘a’ and ‘b’ can take any value from -inf to +inf.
There are infinite number of lines that can be drawn in the x-y plane, but let’s look at just three of them.
In figure (1):
The regression line (the red line) can not fit the data correctly, and if we calculated the Sum Squared Error (SSE) for this model on the training data, it will be 22.73.
Sum Squared error (SSE) is the sum of the squared difference between the true value and the predicted value.
Here we say that the model has High Bias as it can’t do well with the training data. We call this underfitting.
Let’s apply this model to the testing data:
The model is doing so bad as it can not predict any point correctly, and the SSE is 25.32. When a model fails to fit the testing data, we say that this model has High Variance.
Observation: The model has High Bias and High Variance.
Let’s try another line:
In figure (2):
When calculating the Sum squared Error again, we find it 14.66
It’s much better than the previous one, but it is still high.
Try again:
In figure (3):
The line can fit the data in a much better way, and the SSE is 6.43
This is the best line for a linear model that can be used to model our data and gives the most approximated predictions.
I used the ‘numpy.polyfit’ function to get the slope and the y-axis intercept of the line that fits the data the best.
Now let’s move to the testing data to see how our model ‘the third one’ will behave with the new dataset.
The model is doing bad as it can not predict any point correctly, and the SSE is 8.43.
2- Polynomial Regression
The equation of this model is polynomial, and can also contain some other mathematical functions like sine, cosine, tan, log, …etc. It can be quite complex. Although the equation is polynomial, it’s still a linear model.
Again there are infinite number of regression lines that can be used to predict our outcome. Let’s try some of them.
(1) The Squiggly Line
Here we see that the model is perfectly fitting our training data and the SSE is zero, so it has low bias.
Let’s now see how this model will do when it deals with new data (testing data).
You can see that performance is dramatically a disaster, as it cannot fit the data or even give an acceptable prediction. We say that the model has a very High variance. The model is overfitting the training data.
Observation: The model has Low Bias and high Variance.
(2) Second order model
The equation for this model is y=ax²+c.
Again the ‘a’ and ‘c’ are the model parameters that need to be tuned to get the best result. They can take any value from -inf to +inf. Let’s pick one and see the result:
We see that our model is doing a good job with the training data, there is some error, but it’s acceptable.
Let’s apply it now to the testing data:
It’s doing good too.
Observation: the model has Low Bias and Low Variance.
(3) Higher-order equations
Note that as the order increases, the model becomes more complex and it starts to overfit the data.
(2) Assume Non-Linear Model
In this model, the parameters being tuned (a, b, c, …) have non-linear relationship. The equation for this model can be for example something like this:
As there are infinite number of equations, let’s pick one and see the result:
In this model, you can see that it does bad with the training data, but does good with the testing data. We say that this model has High Bias and Low Variance.
It may seem a little weird that the model does a good job with the testing data, while it is doing a really bad job with the training data. This may happen according to the nature of our data and also when the model is unable to capture the underlying pattern of the data.
Note: The balance between the Bias error and the Variance error is called the Bias-Variance Tradeoff.
After this example, we have now a clear view about bias and variance and how they affect our model performance.
Let’s now put things together and write down our conclusions and notes.
What is Bias?
- Simply, Bias is the difference between the predicted value and the expected/true value.
- The model makes certain assumptions about the data to make the target function simple, but those assumptions may not always be correct.
- The bias results from the assumptions in the model. For example, suppose that we used a linear model on a data that has a trigonometric relationship. This model will have a high bias because we have taken the wrong assumptions about the data and the model would be trained using this wrong assumptions.
- A high bias model makes more assumptions about the target function.
- High bias can cause an algorithm to miss the correct relationship between features and the target output (underfitting).
- The bias error is the error due to wrong/in-accurate assumptions that the learning algorithm makes during training.
- Zero bias may sounds good as the model perfectly fits the training data, but this means that the model has learned too much from the training data, it is called overfitting and the model will not be able to do a good job with the new/testing data.
What is Variance?
- The Variance is when the model takes into account the fluctuations/noise in the data during training. By noise we mean any small fluctuations due to human error or natural phenomena that may result in some wrong/inaccurate data.
- Variance is the error due to sensitivity to small fluctuations in the dataset.
- A high variance model learns too much from the data as it still considers the noise as something to learn from, as a result, it becomes very sensitive to any small fluctuation, and it overfits the training data. And then when it is applied to new data, it is unable to predict the outcome correctly.
- High variance can cause an algorithm to model the random noise in the training data, rather than the intended outcome.
Bias-variance Tradeoff
- Increasing bias decreases variance, and increasing variance decreases bias.
- A model that exhibits low variance and high bias will underfit the target, while a model with high variance and low bias will overfit the target.
- Our goal is to reach a model with Low Bias and Low Variance.
In the above diagram, the center i.e. the bull’s eye is the target that the model tries to predict correctly. As we move away from the bulls-eye, the model starts to make more and more wrong predictions.
A model with low bias and high variance predicts points that are around the center, but far away from each other. A model with high bias and low variance is far away from the bull’s eye, but since the variance is low, the predicted points are closer to each other.
The challenge is to find the right balance between the bias and variance of the model.
I hope that you enjoyed this article. Next time, we are going to talk about how we can solve this problem using Regularization, Cross-Validation and many other methods.