Regularization — a solution to overfitting

Yogesh Khurana
Yogesh Khurana’s Blogs
4 min readDec 20, 2019

Fed up with Model Over-fitting? Regularization comes to the rescue…

When you are working on data to prepare a model, you must be working with variety of features/variables. Some variables may be providing most of the information, at the same time some variables may be irrelevant to the model.

These irrelevant variables create noise in the model. Noise means some irrelevant information or randomness in the data. Here, you must know about the concept of Bias and variance. If your model fits the training data so well that it negatively impacts the performance of the model, it means you have low bias for the training data and that will result to high variance for the new data. This problem is called Over-fitting. So, we can say that over-fitting occurs when your model learns the noise from the data.

Regularization is used to solve the over-fitting problems of machine learning models. A model does not give more weight to a particular feature. The weights are evenly distributed. This can be achieved by doing regularization. Regularization adds a penalty to different parameters of the model to make it more generalized to the new data.

2 types of Regularization:

  • Lasso Regression (L1 regularization)
  • Ridge Regression (L2 regularization)

Ridge regression is a technique of regularization which adds penalty to the error function or cost function. In case of regression, the cost function is Sum of squared residuals. Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.

Ridge regression tries to bring a new line that does not fit the training so well as compared to Least squared line. It tries to increase the bias with the training data and reduces the variance when trying to predict the new data. In other words, by slightly compromising with a training data fit, Ridge regression provides a better long term results.

So, how Ridge regression works??

As already mentioned, Ridge regression adds a penalty to the loss function. So here the loss function is the sum of squared residuals. The new loss function would be:

New Cost function: the sum of squared residuals + λ * slope²

The first part of the formula is same like OLS (Ordinary least squares), the second part is the penalty. Slope² adds a penalty to the loss function and λ determines how severe the penalty is. So, now our motive should be to reduce the new loss function instead of the old one. λ can take any value from 0 to positive infinity. λ =0 means there is no penalty and Ridge regression line is equal to least squared line. And when we start increasing the λ, the slope of the ridge regression line gets decreased which ultimately reduces the loss function.

So what is the optimal value of λ?? Here, we can use Cross-validation technique to try different values for λ and check which one gives the less variance when trying to predict the new data. One important point to remember is, Ridge regression penalty contains all the parameters except intercept.

Lasso regression (Least Absolute Shrinkage and selection operator) is same as Ridge regression except 2 key differences:

  • Instead of slope², Lasso takes absolute value of slope i.e. |slope| as part of New cost function: the sum of squared residuals + λ * |slope|
  • Lasso regression shrinks the less important feature’s coefficient to zero thus, removing some features altogether.

So, we can say that ‘Lasso regression’ works well when your data has many useless variables/features and therefore contributes towards features selection. And ‘Ridge regression’ works well when your data has more useful variables.

Some important points:

  • Regularization techniques can be applied to both Regression and classification models.
  • Old methods like Cross-validation and step-wise regression to handle over-fitting and perform feature selection work well with a small set of features. But for a huge set of features, Lasso and Ridge are good alternatives.
  • When we have a huge data and the sum of squared residual is less, then we can be fairly confident that the least squares line accurately predicts.
  • But when we have a very small dataset and very less sum of squared residuals, then there are chances of over-fitting, means the line tried to touch each and every point, which means reducing the bias, and increasing the variance when trying to make predictions for new data.

In this article, I have included the concept from theoretical point of view. I will include the python code in the next article.

--

--