A Theory of Overfitting and Underfitting in Machine Learning

Bias - Variance Tradeoff

Published in

Analytics Vidhya

6 min readMar 3, 2020

A brief introduction to Bias, Variance, Regularisation and how to choose the regularization parameter, effect of overfitting and underfitting.

It has always been hard for me to understand what this term represents. But, its very easy, Whenever I think I came across only one definition High bias causes under-fitting and high variance causes over-fitting . It’s necessary to understand Bias and Variance as Its quite an important concept in setting the hyperparameters to solve the problem.

In machine learning , the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples , and vice versa. The bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set

underfit — generalised — overfit
source:coursera

What is Bias?

Bias is the simplifying assumptions made by the model to make the target function easier to approximate .It is the difference between your model’s predicted values and true values.It is the measure of how far predictions are from original value.

High bias can cause an algorithm to miss the relevant relations between features and target outputs (under-fitting).Bias is the accuracy of our predictions. A high bias means the prediction will be inaccurate

How well does the model predict the training data ?

HIGH BIAS: The model doesn’t do good job of predicting the training data.(LOW accuracy on Training data). Cause : Under-fitting

LOW BIAS: The model do good job of predicting the training data.( HIGH accuracy on Train data)

What is Variance?

Variance is the measure of dispersion in a data set. In other words, it measures how spread out a data set is.

Variance is the amount that the estimate of the target function will change if different training data was used.

The model learns the noise and fluctuations of training data as underlying patterns and concepts of the data .
Noise and Fluctuations are unique to the training set
Thus the model fails when it sees new data .So performs poor on test data.

How sensitive to the training data is learned model ?

HIGH VARIANCE : Changing the training data can drastically change the learning model.The model performance is poor on test data. Cause : over-fitting

LOW VARIANCE : Changing the training data doesn’t effect much in learning model.The model performance is high on test data.

Graphical illustration of bias and variance

From right bottom(high bias-high variance) to left top(low bias-low variance) model performance increases i.e.,

accuracy increases
uncertainty decreases

Generally Complex models have high -variance and low bias i.e model with more number of parameters.

Models with very few parameters have low variance and high bias and models with more parameters have high variance and low bias.

Bias vs. variance refers to the accuracy vs. consistency of the model.

UNDERFITTING :

Underfitting refers to a model that can neither model the training data nor generalize to new data.

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An underfitted model is a model where some parameters or terms that would appear in a correctly specified model are missing. Underfitting would occur, for example, when fitting a linear model to non-linear data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on both training data and test data.

The following ways can also be used to tackle under-fitting.

Increase the size or number of parameters in the ML model.
Increase the complexity or type of the model.
Increasing the training time until cost function in ML is minimised.

OVERFITTING :

Overfitting occurs when a model or machine learning algorithm captures the noise of the data.

Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data..The overfitting model contains more parameters than can be justified by the data. Overfitting is to have unknowingly extracted some of the residual variations (noise) as if that variation represented underlying model structure.

Overfitting occurs when a model begins to Memorize data rather than learning to generalize from a trend.

The following ways can also be used to tackle overfitting.

Reduce the size or number of parameters in the ML model.
Reduce the number of features manually or by model selection algorithm (dimensionality reduction)
Regularization

The bias–variance decomposition is a way of analyzing a learning algorithm’s expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.

Let the variable we try to predict be y and features be X.

y=f(X) + err, where err is error term

we take f^(X) as hypothesis , so expected squared error is,

by further decomposition ,

Total Error = Bias² + Variance + Irreducible Error

REgULARIzATIoN:

If we have too many features the learned hypothesis may fit the training set very well

J(theta)= sum( [h(x^i) -y^i]**2 ) tends to zero

but fails to generalize to new examples, as stated above. Consider a model with degree 4,

Regularizing the cost function L (penalization)

By reducing the number of features model lose the information . So regularization is used.

Keeps all features but reduces magnitude of parameters by penalizing the cost function and make some parameters very small, Thus the model complexity decreases and less prone to overfit.

Lambda (λ) is regularization hyper-parameter . It should be selected carefully .The regularization parameter is a control on your fitting parameters.

The Tradeoff

Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don’t tend to overfit but may underfit their training data, failing to capture important regularities.

The hyperparameter λ controls this tradeoff by adjusting the weight of the penalty term. If λ is increased, model complexity will have a greater contribution to the cost. Because the minimum cost hypothesis is selected, this means that higher λ will bias the selection toward models with lower complexity.

The cost function given by Russel & Norvig’s:

Cost(h) = EmpericalLoss(h) + λComplexity(h)

CONCLUSION:

An optimal balance of bias and variance would never overfit or underfit the model which is obtained by tuning the hyperparameter lamdba (Regularizarion parameter).

Therefore understanding bias and variance is critical for understanding the behavior of prediction models.

A good generalized model always leads, this is the reason, in competitions people who top in public leader board sometimes don’t top in private leader board and the model which doesn’t top on public leader board but generalized well gets high accuracy on private data.

Bias — Variance.😀😉

HAPPY LEARNING