Bias — Variance TradeOff & Regularization
What is Bias?
If machine learning model is performing very badly on a set of data because it is not generalizing to all your data points, This is when you say your model has high bias and the model is said to underfit.
- The Error between average model prediction and ground truth
- The Bias of the estimated function tells us the capacity of the underlying model to predict the values
What is Variance?
If Machine Learning model tries to account for all or mostly all points in a dataset successfully. If it then performs poorly when run on other test data sets, it is said to have high variance and the model is said to overfit.
- Average Variability in the model prediction for the given dataset
- The Variance of the estimated function tells you how much the function can adjust to the change in the dataset
High Bias
- Overly-Simplified Model
- Under-Fitting
- High error on both test and train data
High Variance
- Overly-complex model
- Overly-Fitting
- Low error on train data
- High error on test data
- Starts modeling the noise in the input
Bias Variance Trade-Off
- Increasing bias reduces variance and vice-versa
- Error = Bias² + Variance + irreducible error
- The best model is where the error is reduced.
- Compromise between bias and variance.
Regularization
The regression method used to tackle high variance is called regularization.
We try to minimize the error (cost function), Observe that the cost function was dependent on the coefficients
In such cases, the primary objective is to minimize error. There is no restriction on how small or large the coefficients can be, to achieve this objective. But in real life, we need to achieve objectives with some restrictions imposed.
- For Example, We need to minimize the cost function in Linear regression but with some constraints on coefficient values. This is because too high values of coefficients may be unreliable both for explanation and prediction as they lead to overfitting.
- Hence, to the cost function, we add these constraints that the sum of the squared coefficients values or sum of absolute values of coefficients. If the sum is more, then the cost function value increases, and hence that cannot be the optimal solution.
- The Optimal solution will be the one where the sum of the coefficients (or coefficients squared) will be minimum.
- The equations can be defined as.
The Above Equation is known as Ridge Regression and instead of m² if we have a modulus of m, then it’s called Lasso Regression
Practically, the factor λ decides the extent of penalization. Observe that if λ=0, then there is no regularization (it’s the same as the original loss function).
Loss Function: Mean Squared Error
If lambda is very high, there is a high penalization on the values of coefficients that they are small.
In the case of lasso regression, the coefficient of the variables can be made 0 hence this can be used as a feature selection model.
In the case of ridge regression, the coefficients can be made near zero but not zero.
The whole idea of regularization is to reduce overfitting. It is an observation that high coefficients (generally with no regularization) values may not generalize the data and it may lead to overfitting.
At the same time, too low coefficient values (obtained with high values of lambda) may not give the complete picture and hence the model may not perform well on a train as well as test. This is an Underfit.
The lambda value needs to be appropriately chosen such that the problem of overfit/underfit is lessened.

