Data Science fundamentals: explaining the bias-variance trade-off

Patrick Stewart
3 min readOct 10, 2021

--

As a data scientist, your aim when building a suitable model is to optimise the prediction of a target variable based on a number of indicators (features). So, how do we judge this optimisation? In machine learning, the goal is to estimate a function that minimizes the mean squared error distance between the estimated function and the true function. This can be represented by the equation below:

Where:

· n is the number of predictions.

· y_i is the actual target value.

· y_hat_i is the predicted target value.

Errors produced

For any machine learning model, there are two types of errors that affect MSE that need to be considered.

1. Bias error — assumptions made by a model to make what the target values are supposed to be easier to learn.

2. Variance error — amount that the estimate of the target will change if different training data is used.

We can actually prove this using the MSE formula which becomes:

However, this proof is not addressed in this article for simplicity.

Bias error

Simpler algorithms such as linear and logistic regression have a high number of simplifying assumptions. Their appeal is the speed of the approach but are often weaker predictors for predicting performance on problems with high dimensional scope as they boast a lower flexibility to deal with these problems.

In comparison, more complex machine learning methods like support vector machines have greater flexibility in solving more complex problems and are typically better able to fit the data. Therefore, they have a lower bias error.

Variance error

Variance error is derived from the amount that the estimate of the target will change if different training data is used. More complex (typically nonlinear) machine learning algorithms have a higher variance as their flexibility in solving that specific dataset accurately means a greater impact from using a new dataset. In comparison, less complex solutions do not have this issue to the same extent.

Bias-variance trade-off

So what can we say from this? Quite simply as a Data Scientist you will always need to have this problem in mind and that decreasing variance will increase bias and vice-versa. The real bias and error terms will never be clear as we do not know the answers to our test data, although it is essential as a practicing data scientist that the trade-off should always be kept in mind.

--

--