Lasso and Ridge: the regularized Linear Regression

Dipanshu Prasad
Analytics Vidhya
Published in
6 min readSep 22, 2020

The Linear Regression is the classic and the simplest linear method of performing regression tasks. It brings us the power to use the raw data as a tool and perform predictive and prescriptive data analysis. But, being simple doesn’t discredit its usability in any way. Linear Regression is still a very powerful and widely deployed algorithm for the various advantages it provides:

  • It can be used for fairly large datasets
  • It is very interpretable: It is very easy to figure out why our model gave the particular output to a given input.
  • Features don’t need to be scaled before fitting a regression linear model onto them
  • No need to worry about adjusting parameters because there are none

However, as it is often said that every great thing has a flipped side, this one is no exception.

Too much simplicity of the simple linear model leaves us with no control over its complexity. It just tries to achieve a single goal and that is to reduce the Root Mean Squared Error (RMSE) of the data using a suitable hyperplane.

The simple linear model is prone to both under fitting and overfitting.

Too much randomness in the data or a complex relationship between features and target lead to underfitting. A possible solution to this problem is to introduce “new” complex polynomial features to help the model identify the trend in the data better. You can find more about polynomial features at https://medium.com/analytics-vidhya/polynomial-regression-the-curves-of-a-linear-model-bef70876c998.

Too many features to learn from and relatively fewer number of samples often lead to overfitting. We have a few elegant solutions to this problem at our disposal in the form of Lasso and Ridge Regressions. Using these tools, we intentionally decrease a model’s complexity because we wish it focuses more on generalizing well to new data rather than learning each and every data point, including the outliers.

We need to be familiar with a some terminologies before we proceed.

Regularization Norms

There are two set of regularization norms usually studied and implemented. Those two are L1 regularization and L2 regularization. They both have their own strategies for pulling down a model’s complexity using the concept of penalization for the weights which have high magnitudes.

L1 Norm: The basis for penalization is the sum of absolute value of the weights for the features. It tries to achieve a sparse solution where most of the features have a zero weight. It can have multiple solutions. Essentially, L1 norm performs feature selection and uses only a few useful features for building prediction models and completely ignores the rest of the features.

L2 Norm: The basis for penalization is the squared sum of weights. It tries to reduce the magnitude of weights associated with all features, thereby reducing the effect of each feature on the predicted value. As it involves a squared term, it is not preferred when dealing with outliers. It always has a unique solution and handles complex datasets better than L1 norm.

Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) Regression not only uses the fundamental concept of Linear Regression which involves properly tuned selection of weights that improve the reliability of a prediction, but also has another constraint to abide. It uses the L1 norm.

The highlighted term is the penalty used in Lasso Regression (L1).

We see the introduction of a new parameter λ. The value of λ controls the degree of regularization. A high value of λ means higher degree of regularization; simpler model; more weights having the value equal to zero. The case where λ=0, the Lasso model becomes equivalent to the simple linear model. Default value of λ is 1. λ is referred as alpha in sklearn linear models.

Let’s watch Lasso Regression in action!

We create a synthetic dataset with five features using sklearn’s make_regression method. We also add 5 new features with random values that will serve as “useless” features and we expect them to have very low coefficient values.

Then, we use the dataset to train various linear models.

Now, let’s look at how the coefficients change with the value of alpha.

We see that the values of coefficients vary with alpha. As alpha increases, magnitude of coefficients are pushed closer to zero. In fact, at alpha=10, four of the five random features become exactly zero, just like we expected.

Ridge Regression

It also uses a modified version of the simple linear regression loss function. It uses L2 norm for regularization

The highlighted term is the penalty used in Ridge Regression (L2).

As in Lasso, the parameter λ controls the amount of regularization. A high value of λ means stronger regularization; simpler model; lesser magnitude of weights. At λ=0, Ridge performs the same as a simple linear model. Default value of λ is 1.

The Ridge regression makes a trade-off between model simplicity and training set score.

Looking at the effect of alpha on the value of coefficients,

We see a similar trend in the relationship of coefficients and alpha. However, the coefficients don’t actually attain the value zero in Ridge regression.

ElasticNet Regression

ElasticNet Regression is a powerful algorithm that combines the power of both Lasso and Ridge regression. Thus, it can perform feature selection like Lasso and push coefficients closer to zero like Ridge because it has to obey both L1 and L2 penalties.

Like Lasso and Ridge, the degree of regularization is tuned using the alpha parameter. There’s also a mixing parameter called l1_ratio which allows us to tune the amount of L1 and L2 regularizations.

  • l1_ratio lies in [0,1]
  • Its default value is 0.5
  • l1_ratio = 0 corresponds to pure Ridge regression
  • l1_ratio = 1 corresponds to pure Lasso regression

We see coefficients strongly pushed to zero in this case as compared to Lasso and Ridge.

Leave a clap if you found it informative!

--

--