A brief explanation of -Overfitting , Underfitting ,Variance and Bias

Pavan Kunchala
Analytics Vidhya
Published in
3 min readJan 12, 2021

I understand that this is a really basic concept that many machine learning enthusiasts out there can put it in better words than me but as basic as they might be there were many times(more times than I want to accept) I was really confused with differences between variance and bias or what did terms like high-variance or low-bias mean

Bias and Variance (what exactly are they)?

Bias is a set of assumptions that an ML algorithm makes to learn the representations underlying the given data.So to put it simply if a model performs really well on the training data it is said to have low-bias (as it has trained well and it doesn’t make any assumptions and knows what’s happening ) so if a model makes more assumptions on the data (yes as you have guessed it) it has high-bias.

Some examples of low-bias algorithms are k-nearest neighbors and support vector machines , while algorithms such as logistic regression and naive Bayes are generally high-bias algorithms.

Variance in ML refers to the information present in the data .Therefore, high-variance refers to the quality of how well an ML model has been able to capture the overall information present in the data given to it whereas low-variance conveys just the opposite. SVM is an example of high-variance algorithm whereas naive Bayes is an example of low-variance.

Overfitting and underfitting

When an ML model performs very well on the training data but poorly on the data from either the test set or validation set, the phenomenon is referred to as overfitting.(It’s like a horse which can only run with its maximum speed in the track it practiced before but never on the tracks of tournament(real-world)) . There could be many reasons for this to occur, here are few common ones.

The model is very complex with respect to the data. A decision tree with very high levels and a neural network with many layers are good examples of model complexity in this case.

The data has lots of features but very few instances of the population.

A model which is overfitting is also treated as a model with very high-variance. Regularization is the most widely used approach to prevent overfitting.

If a model fails miserably on the training data, it is said that the model has a high-bias and the model is underfitting. There can be many reasons for underfitting as well. The most common ones are:

The model is too simple to learn the underlying representation of the data given to it.

The features of the data have not been engineered well before feeding them to the ML model.

From the stuff we have learned from above we can conclude that an ML model that is overfitting might be suffering from the issue of high variance whereas an underfitting model might be suffering from the issue of high bias.

The End!(at least for now)

PS: If you have any doubts you can mail me here , you can contact me on my linkedin from here and you can check out my other codes(it has really cool stuff) on my Github from here

I am also looking for Freelancing opportunities in the field of Deep Learning and Computer vision if you are willing to collaborate, mail me here( pavankunchalapk@gmail.com)

Have a wonderful day!

--

--

Pavan Kunchala
Analytics Vidhya

Machine learning & Computer Vision Engineer |Deep learning and Reinforcement learning enthusiast