Regularized Linear Models

Shivani Sinha
Analytics Vidhya
Published in
4 min readMay 3, 2020

--

In machine learning, we often face the problem when our model behaves well on training data but behaves very poorly on test data. This happens when the model closely follows the training data i.e overfits the data.

Source: Google

Regularization is a technique to reduce overfitting. The term regularization means the act of bringing to uniformity.

Complex models can detect a subtle pattern in the data, but if the data is noisy(contains irrelevant information) or the dataset is too small, the model will end up detecting the pattern in the noise itself. When we use this model to predict our results, the result will not be accurate and the error will be more than the expected error.

In linear regression, the final output is the weighted sum of the feature variables which is represented by the below equation.

y = w1x1+w2x2+w3x3+…+wn xn+w₀

Here weights w1, w2, …, wn represent the importance of the features(x1, x2,..xn). A feature will be of high importance if it has a large weight associated with it.

The error in linear regression will be the mean squared error which is given below:

To improve the model or reduce the effect of the noise in our model, we need to reduce the weights associated with noise. Smaller the weight associated with the noise will be the less contribution it will have in predicting the output.

For a linear model, regularization is achieved by constraining the weights of the model. To constrain the weights first we need to understand how these weights are calculated. Weights are calculated as per the cost function, for linear regression cost function is mean squared error. Weights are tweaked each time and MSE is calculated and the set that has minimum MSE will be considered as the final output.

To regularize the model, the regularization term will be added to the cost function.

Regularized Cost Function = MSE+ Regularization term

Here we will see three different regularization term to constrain the weights of the model, thus three different regularized linear regression algorithms:

  1. Ridge Regression
  2. Lasso Regression
  3. Elastic Net

Ridge Regression:

In ridge regression, The regularization term is the sum of the square of the weights of the model. In statistics, this regularization term is called L2 norm. It forces the model to keep the weight as small as possible.

Here important to note that the regularization is only applied to our training data and we keep the testing data intact as we want to keep our test set as close to the final objective as possible.

In the above equation, alpha is a hyperparameter, it controls how much we want to regularize our regression model. If we choose a very large alpha then the learning algorithm will try to keep weights as small as possible because large weights will increase the cost function, thus the result will be a flat line passing through the mean of the data. If alpha is 0 then ridge regression is nothing but linear regression. To choose the best hyperparameter value, we do hyperparameter tuning.

Another important thing to note here is whenever we apply this technique, we first scale the data, as ridge regression is sensitive to the scale of input features. This is true for most of the regularized models.

Lasso Regression:

Least absolute shrinkage and selection operator regression(Lasso) uses L1 norm for regularization i.e absolute sum of the weights of the model.

An important characteristic of the lasso regression is, it tends to eliminate the features which have less importance by shrinking the weights to zero, and because of this it is used in feature selection also.

Elastic Net:

Elastic net is a mix of ridge and lasso regression, how much you want to mix depends on the value of ‘r’. For example, if r is set to zero then it will be equal to lasso and if it is one then it will become ridge regression.

The important question if how we will decide which regression we should follow. The answer is, we should always prefer to have some regularization, so we use ridge regression by default but when you think some features are important than others use lasso regression or elastic net but when the data set has large number of features prefer elastic net. Elastic net behaves much better than lasso when the dataset has large number of features.

Thanks for reading!

--

--