Bias-Variance Tradeoff

Rahul Sehrawat
3 min readOct 18, 2021

--

The words may be somewhat self-explanatory, but they can be confusing for people who are new to machine learning and data science. In this blog, I will try to explain everything you need to know about Bias-Variance Tradeoff. Let’s start with the definitions first and then will dive into the concept.

What is Bias?

In Data Science, bias is a deviation from expectation in the data. More fundamentally, bias refers to an error in the data. It tells you how far is your predictions from the actual values. In mathematical terms, it is the average of the difference between the actual values and the predicted values. A high bias model will give very low accuracy on both train and test data.

What is Variance?

variance is the variability in the model prediction-how much a ML model can adjust when we change the dataset. In a high variance model, the model tends to learn everything from the training data set. So it will give a good accuracy on the training dataset but has high error rates on the test data.

What is Bias-Variance Tradeoff?

The ultimate goal of any Data Scientist is to make a model that can work on multiple datasets. You need a generalized model that has low bias and low variance. It should give a good accuracy on both train and test datasets. We need to find the right/good balance without overfitting and underfitting the data.

Let’s try to understand this with the help of the diagram above. We can see that as the Bias decreases the Variance increases. As that happens the complexity of the model increases and it tends to overfit the train data. We need a point where there’s a balance between the bias and variance so that the model neither underfit nor it overfits the dataset.

Lets’s understand this with an example. Suppose you need to prepare for an exam and you start preparing from the sample papers. So the sample papers would be your train data and the actual exam would be your test data. If you just learn everything from the sample papers, then you might get good accuracy on the training dataset but you might not score that much in the actual exam. It means the model is suffering from high variance. So you need to broaden your training dataset.

But if you study from multiple sources rather than learning everything from the sample papers then there’s a higher chance of you scoring well in your exam. This is what a generalized model should be. It should give similar results on both train and test datasets.

So how do we make a balanced model?

There are no rules to make a generalized model but there are certain things that you can do to prevent your model from overfitting or underfitting.

To prevent overfitting

  1. Make sure you don’t have redundant features in your dataset
  2. Use regularization(L1/L2)
  3. Use Ensemble methods (Bagging/Boosting)

To prevent underfitting

  1. Make sure you have sufficient data
  2. Make sure you have sufficient features.
  3. Remove outliers from the dataset.
  4. Increase model complexity

Conclusion

In the end, I would like to add that there’s no such thing as a perfect model. It depends on your dataset. There’s jargon in data science “Garbage In Garbage Out”. It means you can’t have a good model without a good dataset. So in order to get a balanced model you have to train your model with quality data.

--

--