Ridge Regression (now with interactive graphs!!!)

Ishan Shishodiya
ml-concepts.com
Published in
7 min readMay 4, 2022

So… Ridge Regression is a modified version of Linear Regression. So to learn about Ridge Regression, you have to make sure you understand Linear Regression. If you don’t then click here. If you don’t know what Gradient Descent is, then click here.

It is an absolute must that you know both concepts before proceeding with this article. Take your time to learn these things, I’ll wait for you here.

(Waiting)

(Still Waiting)

Learned it?

Cool. Let’s proceed then.

Oh yeah, one more thing. I made this Kaggle Notebook which has interactive code in it. If you want to see how you can code Ridge Regression in Python from scratch, click here. It’s pretty good not gonna lie. And don’t worry, the content in it is not inferior to this article. But if you only want to learn the concepts of Ridge Regression and how it works, then no problems just follow this article (I won’t know if you skipped the notebook or not anyways, so your lives aren’t under any kind of threat yet).

Why Ridge Regression?

As I mentioned above, Ridge Regression is just a modified version of Linear Regression. It’s something new…but not something entirely new. But why did Ridge Regression even come into existence and why did we all just accept it?

Well…the answer is pretty simple.

In Linear Regression you need a lot of data to make accurate predictions. But if you only have a small subset of the original data, then your predictions would be pretty whack (inaccurate). Ridge Regression solves this by allowing us to make accurate predictions even if we have very limited data. Let’s take an example of this.

Suppose you have two lists x and y.

x = [1, 2, 5, 6, 8, 9, 12, 14] and y = [3, 6, 8, 4, 9, 12, 9, 12].

If we plot a line of best fit for this data using Linear Regression with Gradient Descent (we discussed it in this article), it would look something like this,

Ridge Regression — Line of best fit using Linear Regression with Gradient Descent
Line of best fit using Linear Regression with Gradient Descent (Click here for an interactive chart) (Image 1)

But suppose we didn’t have the whole data, but only a subset of it. Like the first two items from x and y. Because we have a lot less data here compared to the previous example, we can assume that the prediction from this model won’t be very accurate.

Let’s plot the line of best fit we get from the model trained on this subset.

Ridge Regression — Subset line of best fit vs Original Line of best fit
Subset line of best fit vs Original Line of best fit (Click here for an interactive chart) (Image 2)
Ridge Regression — Subset line of best fit on the Original Data
Subset line of best fit on the Original Data (Click here for an interactive chart) (Image 3)

As you can see from the two graphs above…this new line of best fit which was calculated using a small subset of the original data is not very accurate. It strays off from the original data by a lot and is nowhere close to the original line of best fit.

But if we plotted this new line of best fit using Ridge Regression, we’ll be able to prevent this. But how would we be able to prevent this and what’s the logic behind it? For this question, let me introduce three new terms. Bias, Variance, and Bias-Variance tradeoff.

Bias-Variance Trade-Off

As I said earlier, having a model which has both 0 bias and 0 variance is impossible. But we can surely have the best model by just making sure that the bias and variance are minimized to the least possible value. This can be done by increasing or decreasing the bias/variance of a model and seeing how it affects its variance/bias.

Let’s take the example of overfitting.

In an overfitting model, we see that the model is very good at predicting the training data but very inaccurate at predicting the testing data. And the main reason for this is because the model got way too comfortable with the training data.

So to fix this all we need to do is make the model a little less accurate at making predictions on the training data. That by itself would allow the model to adapt to new testing data.

Ridge Regression — Example of Overfitting
Example of Overfitting (Image 4)

In the image above you can see that the green line fits the training data perfectly and is even accounting for the outliers. But because it’s way too specific and doesn’t follow a particular trend, it’s not very good at predicting new values. In contrast, the black line has a clear trend even though it misclassifies a few values. This trend itself makes it more adaptable to new data compared to the green line.

The green line has very low bias but high variance. But in the case of the black line, even though it has a higher bias than the green line, this high bias allows it to be more adaptable to new data which decreases its variance.

In the above example, we were able to decrease the variance by increasing the bias. This is what is called a bias-variance tradeoff.

Ridge Regression does the exact same thing. Even though the line made using the training data doesn’t fit it as nicely as the line made using Linear Regression, this line would be better at adapting to newer data compared to the other one. We’ll see this in more detail in just some time.

Ridge Regression

So the only difference in Ridge Regression when compared to Linear Regression is the Cost Function. If you remember Gradient Descent, then you can probably recall how important of a role the Cost function played in the prediction.

If you remember, in Gradient Descent we use MSE as the cost function.

Ridge Regression — MSE for Linear Regression
MSE for Linear Regression (Image 5)

For the highest possible accuracy, we want to minimize the cost function i.e. J(β0,β1)≈0.

For Ridge Regression, we’ll change this formula a little.

Ridge Regression — MSE for Ridge Regression
MSE for Ridge Regression (Image 6)

Penalization

This extra term, λ(β21), that has been added to the Cost Function for Gradient Descent is called penalization.

Here λ is called the penalization factor. If the value for lambda is set to a very large number like 100000, then the slope of the best fit line would be very close to 0. Not exactly zero, but very close to it.

This new term penalizes the large slope values by giving those values a high Cost Function value. This is done because large slope values can be a sign of overfitting. For large slope values, β1 would be a large number. Meaning the whole term λ(β21) would be a large number, which would in turn affect the Cost Function. Meaning, that for large values of β1 our Cost Function won’t be minimized.

Let’s use this new Cost Function for plotting a line of best fit on the subset of the data.

Ridge Regression — Ridge Regression line of best fit vs the line of best fit from the subset
Ridge Regression line of best fit vs the line of best fit from the subset (Click here for an interactive chart) (Image 7)

You can see that the new line we got using Ridge Regression is much different from the older one, even though both were trained on the same subset of the data. The new Cost Function seems to be doing something for sure. The two lines here are like your average pair of siblings. Even though both grew up in the same environment, the younger one did better than the older one.

Bad analogies and failed attempts to make a good joke aside, let’s plot this line of best line we got from Ridge Regression on the whole data and compare it to the line of best fit we got from the whole data.

Ridge Regression — Ridge Regression line of best fit vs the line of best fit for the whole data
Ridge Regression line of best fit vs the line of best fit for the whole data (Click here for interactive chart) (Image 8)

You can see that the line of best fit we got from applying Ridge Regression on the subset of the data and the line of best fit we got by applying Linear Regression on the whole data nearly overlap. That’s good news!

So…that’s Ridge Regression in a nutshell. It isn’t difficult if you already understand Gradient Descent, but if you still had some difficulties don’t worry. Everyone learns at a different rate. Go through the notebook once more, I am sure you’ll understand it!

Sources -

  1. Kaggle Notebook — https://www.kaggle.com/code/slyofzero/ridge-regression-from-scratch
  2. Youtube video on Ridge Regression by StatQuest — https://youtu.be/Q81RR3yKn30
  3. https://en.wikipedia.org/wiki/Ridge_regression#:~:text=Ridge%20regression%20is%20a%20method,econometrics%2C%20chemistry%2C%20and%20engineering.
  4. https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/
  5. https://towardsdatascience.com/from-linear-regression-to-ridge-regression-the-lasso-and-the-elastic-net-4eaecaf5f7e6

--

--