The new kids on the block — Elastic net, LASSO and ridge

Published in

eat-pred-love

3 min readJun 22, 2017

Elastic net, lasso and ridge are the new cool kids that joined your neighborhood. They provide the union of statistics and machine learning. They are known as pernalized regression models.

“ My personal bias is this: In this day and age, if you are building linear models without any regularization, you must be really special. Always regularize. It’s required.” — Owen Zhang, NYC data science academy

Penalized regression helps to deal with 2 major issue in regression: feature selection and multicollinearity. It helps to prevent overfitting and outputs a model that fits the training data less well as compared to OLS but able to capture the generalization better and less sensitive to variation from outliers. Elastic net is a general form of ridge and lasso. Think of it as a combination of both. Recall in usual regression, we make use of OLS — minimizing the sum of squares of the difference between actual and predicted values.

Assuming we have M number of features and N number of data points:

Whereas the elastic net regularization estimate process will minimize the sum of squares and a penalty based on the size of the estimated coefficient. We call these penalties the L1 / L2 Norm

The hyperparameter lambda controls the weight of the penalty. The optimal value of alpha and lambda is obtained using a grid search method for the best cv value (k-fold cv)

Depending on which penalty is imposed, it will be known as Ridge (L2 ) or Lasso (L1).

Ridge Regression

Residual sum of squares with additional penalty (L2)

Ridge helps to deal with the instability of estimates of coefficient of linear models when they are collinear.
Coefficients of unimportant features will shrink to near zero
Hyperparameter alpha = 0
Minimize the objective (RSS + L2 Norm)
As lambda increases, RSS increases and model complexity decreases

Lasso regression

Residual sum of squares with additional penalty (L1)

Lasso perform feature selection that forces coefficents for some variables to be zero.
Hyperparamter alpha = 1
Minimize the objective (RSS + L1 Norm)
As lambda increases, RSS increases and model complexity decreases
Note that LASSO will remove variables (giving coefficient 0) so as lambda increases, expect high sparsity.

Theoritically, if lambda = 0, then it will be minimizing only the RSS and this will be the regular regression that we know. However, as pointed out by peter, the result output by glmnet will be different due to numerical calculation issues.

The best of both worlds?

By having both penalties together, it will give us the elastic net regularization. (2005)

Originally published at germayneng.github.io on June 22, 2017.

The new kids on the block — Elastic net, LASSO and ridge

“ My personal bias is this: In this day and age, if you are building linear models without any regularization, you must be really special. Always regularize. It’s required.” — Owen Zhang, NYC data science academy

Ridge Regression

Lasso regression

The best of both worlds?

Written by germayne