Why Regularization? A brief introduction to Ridge and Lasso regression
In regression, the standard linear model describes a relationship between Y and a set of variables X1, X2, . . ., Xn and we fit this model using the least squares.
However, this standard linear model is not perfect and can be improved by using some fitting procedures. One such technique is “Shrinkage”.
It not only helps yield better prediction accuracy but also better model interpretability.
This approach helps in significantly reducing variance by shrinking the estimated coefficients towards 0 relative to the least square estimates i.e. we fit a model containing all the predictors using a method that regularizes or shrinks the coefficients towards 0. The 2 most popular techniques used for shrinking the estimated coefficients are Ridge Regression and Lasso Regression.
1. Ridge Regression
In a simple linear model, the least square method estimates coefficients using the values that minimize RSS (residual sum of squares)
Ridge regression is quite similar to the least squares method. There's an additional term, called a shrinkage penalty, added to the RSS. Just like the least squares, Ridge regression also looks for coefficient estimates that minimize the RSS. Ridge regression minimizes the quantity
where λ >0 is the tuning parameter that controls the relative impact of the two terms on the regression coefficient estimates. When λ=0 the shrinkage penalty term has no effect and the estimates for both ridge and least squares are the same. But when λ→∞ the impact of shrinkage penalty term increases and the ridge regression coefficient estimates tends towards 0. For each value of λ, ridge regression produces a different set of coefficient estimates. Thus selecting a good value of λ is critical.
Why use Ridge regression?
- When a relationship between the response and predictors is close to linear, the least square estimates will have low bias and high variance, meaning that a small change in the training set data may cause a large change in the least square coefficient estimates. This is when ridge regression works the best because as the value of λ increases, the flexibility of the ridge regression fit decreases resulting in decreased variance and increased bias.
- Ridge regression also has a computational advantage. The computation required to find the coefficient estimates for all values of λ is almost similar to fitting a model with least squares.
Drawback
Ridge regression helps reduce the variance but has a drawback. The penalty term shrinks the coefficients towards 0 but they never are exactly 0. This won’t be a problem with prediction accuracy but when the number of predictors is large, model interpretability turns out to be a challenge.
2. Lasso Regression
In cases where the number of predictors is large, we might just want to include the important predictors. However, ridge regression always generates a model with all the predictors. Increasing the value of λ results in a reduction in the magnitude of estimated coefficients but doesn’t exclude any of the predictors.
This is where Lasso regression works the best. It performs variable selection by minimizing the quantity
The formulation of Ridge and Lasso regression is similar. Just like Ridge regression, Lasso also shrinks the coefficients estimates towards 0. The only difference is Lasso forces some of the coefficient estimates to be exactly 0 i.e. when the value of λ is extremely large, the magnitude of some coefficient estimates equals 0 making the model easier to interpret.
Which technique is better?
We now know that Lasso has an advantage over Ridge regression in terms of model interpretability, as the model involves only a subset of predictors.
But what about prediction accuracy?
Well, it depends. Lasso performs better where a small number of predictors have substantial coefficients and the rest are either very small or equal to 0. While Ridge seems to perform better when our response variable Y is a function of many predictors. Since the relationship between response and the predictors is unknown, we leverage the cross-validation technique to determine which approach is better.
Ridge and Lasso in Python
Dataset
Used the Boston Housing dataset and where MEDV is the response variable. The dataset has 506 rows and 14 columns. Below is the description of the 14 columns used in the dataset.
A graphical representation of the correlation matrix to see the relationship between the variables
The Mean Squared Error (MSE)after fitting Linear regression is 65.37.
Let’s try fitting Ridge regression to see if the MSE decreases.
Clearly, the MSE decreased from 65.37 to 58.9 when the value of λ is 50.
Now let’s try Lasso regression to see if it performs better than the other two.
It did perform a little better than Linear Regression by bringing the MSE down from 65.37 to 64.25 but overall Ridge regression gave the best results.
Summary
When the least squares have excessively high variance, both Ridge and Lasso regression can help reduce the variance significantly with a small increase in bias and generate better prediction accuracy.
However, since Lasso performs variable selection, the results in the model are easier to interpret.
References
1. An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
2. https://www.youtube.com/watch?v=0yI0-r3Ly40