Understanding Regularization

Kemalcan Jimmerson
The Startup
Published in
4 min readJun 3, 2020

After working as a regional manager at the Scranton office for years, you are appointed to Stamford office in the same organization. Since you were doing great in the previous position, you are predicting the same results in this new role too. It is basically the same job. In a few weeks, however, you realize your prediction was wrong.

In this analogy, your success in your previous role may represent the success of prediction in the training datasets. You trained so well to establish yourself for that position. You know every detail of it. In other words, the R-Squared score of your model in the training data is very high. This gives you a lot of confidence.

As you can guess, your second role in the company may represent your testing data. The poor performance in the new position is like a low R-Squared score of your model in the testing dataset. We all have experiences with models that have high training and low testing scores.

Why is your model not performing well in the testing data while it was doing well in the training data?

It is the same reason why you failed in the new position. Because, you developed your skills specifically and only for your current job requirements, you did not think about what else you might need in the future. You are an expert of what you did; but you have not yet adapted/developed the skills for your new job conditions.

For the same reason, your model was fit well on the training data; but not flexible for the new data. This new role overfits you. In prediction models they are called overfitting models. The overfitting models learn from the noise in the data, rather just signal. They have high variance. In order to understand regularization, you have to be familiar with “Bias/Variance Trade off “ concept. High variance means that our model does not predict well with the new data. Usually it occurs when we create "too complex" models.

An overfitting model matches the training data "too closely". The plot is from https://github.com/timbook/RegularizationPlot

High Variance? What can I do?

There are a few things you can do. One of these may help:

  • Gather more data.
  • Drop some features from your model.
  • Use Regularization.

Sometimes, gathering data may be an expensive solution; because your model will need more computation power to process more data. Even though it helps, sometimes it may not be ideal.

One solutions might be regularization. Regularization means adding a small amount of bias to your model. Bias will decrease your training data accuracy; but in return for this small amount of bias, you will get a significant drop in Variance.

Therefore regularization is your solution to handle overfitting. (Think about being flexible rather than rigid at the work place.) This small amount of bias is called penalty.

How do we use Regularization?

Since there are several articles available about the technicality of regularization, I am not going to talk about the details.

There are 2 common type of regularization.

  1. Ridge Regularization
  2. LASSO Regularization

The difference is how the penalty (manually added bias) is calculated.

A common loss function is Mean Squared Error. It represents the mean of squared distances from the predicted value (ŷ). MSE score gives you an idea of how much error your model makes in its predictions.

We apply the penalty to the coefficients and we attach it to the MSE formula as shown below;

Ridge regression and Lasso regression use different penalty calculations. Here the penalty term identifier ( λ) defines how much you want to regularize your model. This λ is a value between 0 and 1. As it increases, the severity of the penalty increases as well. That's why choosing the λ is crucial for regularization. There are ways to find the optimum λ; but we can discuss them in another post.

Ridge regression (also known as Tikhanov regression) uses λ * slope-squared in the formula. Based on this mathematical notation, we can say that; when λ equals 0, ridge regression penalty is 0. This means there is no penalty. In this case, our model becomes a good old Linear Regression.

The major difference between Ridge and Lasso is that Lasso uses the absolute sum of β values while Ridge regression takes the square. The method that Lasso uses is called ”Manhattan Distance”. It is also known as l1 norm.

Ridge regression uses "Euclidian " norm while LASSO regression uses "Manhattan"norm. https://github.com/timbook/RegularizationPlot

An important note about using regularization is that we have to standardize our data before we use regularization methods. As we saw above, the penalty perm consists of λ and coefficients (β values). In order to penalize each coefficient fairly, the coefficients need to be the same scale.

Additionally, there is another regularization method, called Elastic Net. We did not count Elastic Net among the others; because Elastic Net is a combination of Ridge and Lasso Regression penalty methods.

We use regularizations in both classification and regression models. Many of us use Scikit-Learn classes for creating machine learning models in Python. Please notice that Skicit-Learn uses α (alpha) instead of λ term. It does the same job for as λ parameter. So there is no λ term in Skicit-Learn regularization classes.

Photo by Chris Ried on Unsplash.

--

--