Types of regularization and when to use them.

Saaransh Menon
Analytics Vidhya
Published in
5 min readDec 16, 2020

This article will explain 3 types of regularizations and where and how to use them using Scikit-Learn.

Why use regularization?

First we need to understand why we should regularization. Regularization is mainly used so that a model does not overfit the data. Polynomial models are the most common ones in which regularization can be useful as it may have higher degree features which can cause the model to overfit, what regularization does is reduces the number of polynomial degrees which makes the model not overfit the data.

Now the advantage of using regularization over the other methods such as reducing the number of features in the training data is that when removing the features we loose the valuable information while training. This is why regularization is better as it reduces the effect of the hypothesis parameters (θ).

Here I will be explaining 3 methods of regularization.

This is the dummy data that we will be working on. As we can see its pretty scattered and a polynomial model would be best for this data.

Figure-1: Dummy Data

Ridge Regression

Ridge Regression (also called Tikhonov regularization) is a regularized version of Linear Regression having a regularization term equal to:

Ridge Regression regularization term.

This term when added to the cost function forces the learning algorithm to mot only fit the data but also keep the model weights as small as possible.

This term should only be added to the cost function while training.

the hyperparameter α controls how much you want to regularize the model. Now if α = 0 the model is a normal Linear Regression model without any regularization. If α is very large then all the weights will end up being very close to 0 and the model would just be a straight line.

Ridge Regression cost function:

Ridge Regression cost function

Here MSE Refers to “Mean Squared Error”

If we define w as the vector of the feature weights. Then Ridged regression uses the l₂ norm of the vector w.

Closed for solution of Ridge Regression:

Closed form solution of Ridge Regression

Where A is the (n + 1) × (n + 1) identity matrix with a 0 in the top-left corner, corresponding to the bias term.

Implementing Ridge Regression using Scikit-Learn:

Now, let us see different Ridge models trained on some linear dummy data. On the left we, plain Ridge models are used leading to linear predictions. On the left we have data with expanded polynomial features then Ridge Regression is applied to the data.

Figure-2: Ridge Regression with different α values.

From Figure-2 we can se that increasing the α values lead to more flatter predictions, thus reducing the models variance and increasing the bias.

Lasso Regression

Least Absolute Shrinkage and Selection Operator Regression is another regularized version of Linear Regression just like Ridge Regression it adds a regularization term to the cost function, but it uses l₁ norm of the weight vector w.

Lasso Regression cost function:

Lasso Regression cost function

Lasso Regression cost function:

Lets check out some Lasso models trained on dummy data:

Figure-3: Lasso Regression with different α values

An important characteristic of Lasso Regression is that it tends to eliminate the weights of the least important features for example we can se the right hand plot in Figure-3 where α = 1 the model has almost removed all features because the $\alpha$ value is too high and it looks more like a linear model than a polynomial one. As the α value decreases we see the model turn more polynomial and better fit the data but when α = 0 the model overfits the data so and no Lasso Regression is applied so the sweet spot seems to be α = 1e-07. In other words Lasso Regression automatically performs feature scaling and outputs a sparse model ( i.e., with few non-zero feature weights).

Elastic Net

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. It is a mix between both Ridge and Lasso Regularization and you can control the mix of both using the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equal to Lasso Regression.

Elastic Net cost function:

Elastic Net cost function

Implement Elastic Net using Scikit-Learn:

l1_ratio corresponds to the mix ratio r.

So when to use which?

It is almost always recommended to use some kind of regularization, so you should avoid plain Linear Regression most of the time. So which one should you use?

Ridge Regression is good if want to keep all the features and avoid the weights to blow up it is a good default. But if you think that only few features are useful it would be best to go with either Lasso Regression or Elastic Net as it sets the weights of features that are less useful to 0.

In general, Elastic Net is preferred over Lasso because it may behave erratically when the number of features is greater than number of training instances or when many features are strongly correlated.

Conclusion

In this short article we learnt three different types of regularization methods and how they are useful in preventing the model from overfitting and where each regularization type is useful.

So go ahead and try to implement them in your next model!

--

--

Saaransh Menon
Analytics Vidhya

A Computer Science Engineering student exploring the amazing world of ML and DL. Also a web development enthusiast.