In this blog I will explain the concept of bias and variance
Let’s get clear about overfit and under fit,
Each and every data points in the training data are satisfied by best fit line but the same best fit line cannot satisfy the testing data. The inability of the best fit line to satisfy the testing data while satisfying the training data is called overfitting. In other words, Overfitting is a scenario in which a model perform so well over the training set and just as poorly on the test set
In underfitting, the error is very high with respective training data and testing data. In other words, underfitting is a scenario in which a model perform poor over the training set and test set.
BIAS — Error of training data
VARIANCE — Error of testing data
Bias and variance in regression:
Consider three model with degree of polynomial = 1,2,3 . Degree of polynomial with 1 have a straight best fit line, model with Degree of polynomial with 2 will be curve best fit line and model with DOP with 3 will have a more curvy line that tends to satisfy most of the training points than other two DOP.
When DOP(Degree of polynomial) is 1 ,error is high for both training and testing data
When DOP is 3 ,error is very low for training data and high for testing data.
when DOP is 2 ,accuracy is high for both training and testing data. This means variance(testing error) and bias(training error) both are low.
“Model with low bias and variance is the good model”
Bias and variance in classification:
In model 1, it is clear that testing error(variance) is high but training error(bias) is low so this condition is overfitting.
In model 2, testing error(variance) and training error(bias) are high so this condition is underfitting.
In model 3, testing error(variance) and training error(bias) are low so this will be the best model among all three.
Representation of bias and variance:
From the above diagram while plotting degree of polynomial against error value we got the above graph. As the degree of polynomial increases ,training error reduces and testing error reduces up to certain value and again starts increasing.
We should select the model which has both low training error(bias) and testing error(variance). The above diagram shows how to select the generalized model which has low bias and variance.
Bias and variance in decision tree and random forest:
A decision tree will overfit the data if we keep splitting until the dataset couldn’t be more pure. Initially decision tree will have low bias and high variance. While combining all trees, high variance get converted to lower variance during bootstrap aggregation
“Bias and variance tradeoff is done by opting from decision tree to randomforest”