Ridge Regression: Step by step introduction with example

9 min readMay 18, 2024

Generated by AI: *DNA, Genomic, Abstract, Ridge Regression*

Ridge Regression — short introduction

Ridge regression is a variation of linear regression, specifically designed to address multicollinearity in the dataset. In linear regression, the goal is to find the best-fitting hyperplane that minimizes the sum of squared differences between the observed and predicted values. However, when there are highly correlated variables, linear regression may become unstable and provide unreliable estimates.

Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.

Ridge regression introduces a regularization term that penalizes large coefficients, helping to stabilize the model and prevent overfitting. This regularization term, also known as the L2 penalty, adds a constraint to the optimization process, influencing the model to choose smaller coefficients for the predictors. By striking a balance between fitting the data well and keeping the coefficients in check, ridge regression proves valuable in improving the robustness and performance of linear regression models, especially in situations with multicollinearity.

Linear Regression — for the good start

Let’s briefly recall what linear regression was about.

In linear regression, the model training essentially involves finding the appropriate values for coefficients. This is done using the method of least squares. One seeks the values 𝛽0,𝛽1,…,𝛽𝑝 that minimize the Residual Sum of Squares:

Ridge Regression — definition

Ridge regression is very similar to the method of least squares, with the exception that the coefficients are estimated by minimizing a slightly different quantity. In reality, it’s the same quantity, just with something more, with something we call a shrinkage penalty.

Before we explain what ridge regression is, let’s find out what the mysterious shrinkage penalty is all about.

Shrinkage penalty — aid in learning

The shrinkage penalty in ridge regression

refers to the regularization term added to the linear regression equation to prevent overfitting and address multicollinearity. In ridge regression, the objective is to minimize the sum of squared differences between observed and predicted values. However, to this, a penalty term is added, which is proportional to the square of the magnitude of the coefficients. This penalty term is also known as the ℓ2 norm or Euclidean norm.

𝜆≥0 is called the tuning parameter of the method, which is chosen separately. The parameter 𝜆 controls how strongly the coefficients are shrunk toward 0. When 𝜆=0, the penalty has no effect, and ridge regression reduces to the ordinary least squares method. However, as 𝜆→∞ the impact of the penalty grows, and the estimates of the coefficients 𝛽𝑗 in ridge regression shrink towards zero.

How to choose 𝜆?

How to determine which value of 𝜆 to use?
You might not like the answer. At the beginning, it’s not known.
The only way is to test many values, and that’s typically how it’s done. However, there are many algorithm implementations that assist in selecting the appropriate 𝜆 like cross-validation.

I would like to emphasize that this hyperparameter should not be ignored. Choosing the right 𝜆 is very important.

Why you should scale predictors?

It should also be noted that the shrinkage penalty is applied exclusively to the coefficients 𝛽1,…,𝛽𝑝, but it does not affect the intercept term 𝛽0. We do not shrink the intercept — it represents the prediction of the mean value of the dependent variable when all predictors are equal to 0. Assuming that the variables have been centered to have a mean of zero before conducting ridge regression, the estimated intercept will take the form

It should be emphasized that scaling predictors matters. In linear regression, multiplying the predictor 𝑋𝑗 by a constant 𝑐 reduces the estimated parameter by 1/𝑐 (meaning 𝑋𝑗𝛽𝑗 remains unchanged). However, in ridge regression, due to the shrinkage penalty, scaling the predictor 𝑋𝑗 can significantly change both the estimated parameter 𝛽𝑗 and other predictors. Therefore, before applying ridge regression, predictors are standardized to be on the same scale.

Feature standardization is a preprocessing step in machine learning where the input features are transformed to have a mean of 0 and a standard deviation of 1. This is typically achieved by subtracting the mean of each feature from its values and then dividing by the standard deviation.

If the population mean and population standard deviation are known, a raw variable 𝑥 is converted into a standard score by

Let’s try to understand this with an example:
If one of the variables is the price of an apartment (in hundreds of thousands) and the other is the number of rooms in that apartment (in units), it’s difficult to compare both quantities. After standardization, their variables have similar values (though somewhat abstract), but their distribution won’t change.

How to estimate coefficients in ridge regression?

Just as in the case of regression, where we minimized the RSS, for ridge regression, we minimize the expression we mentioned earlier, but this time let’s express it in matrix form:

To minimize the 𝑅𝑆𝑆_Ridge expression, we will set its derivative (with respect to 𝛽) equal to zero.

The matrix 𝑋^𝑇𝑋+𝜆𝐼 has full rank and it is invertible. As a consequence:

Bias–variance tradeoff of the ridge estimator

The superiority of ridge regression compared to the method of least squares arises from the inherent trade-off between variance and bias. Ridge regression introduces a regularization parameter, denoted as 𝜆, which controls the extent of shrinkage applied to the regression coefficients. As the value of 𝜆 increases, the model’s flexibility in fitting the data diminishes. Consequently, this decrease in flexibility results in a simultaneous reduction in variance but an increase in bias.

Let’s notice:

When the number of predictors, 𝑝, is close to the number of observations, 𝑛, the method of least squares exhibits high variance — a small change in the training data can lead to a significant change in the estimated parameters.
When 𝑝>𝑛, the method of least squares stops working (due to the lack of estimation uniqueness), whereas ridge regression handles this situation well.

Example

Suppose we have a dataset with genetic markers (A, B, C, D, E) as predictors and a trait (T) as the response variable.

At the beginning, we need to specify the value of the hyperparameter for our method, which is 𝜆.
Let 𝜆=2.

Taking data from the example, the matrix 𝑋_RAW takes the form:

We remember that before creating a ridge regression model, we need to standardize the predictors so that they all have a mean of zero and a standard deviation of 1. Using a simple formula for the 𝑧-score:

We obtain a matrix of data after standardizing the predictors.

Unfortunately, after standardization, the numbers may not be convenient for calculations. Usually, the computer handles the computations for us, but if we want to trace step by step how ridge regression works, we have to deal with inconvenient numbers.

To build a ridge regression model essentially means to find the coefficients 𝛽 (as 𝛽 we understand the vector of coefficients (𝛽1,…𝛽𝑝)). From our previous considerations, we already know the recipe to obtain them:

So, first, we need to multiply the transposed matrix 𝑋𝑇 by the matrix 𝑋, and then add to it the scaled identity matrix 𝐼, scaled by the hyperparameter 𝜆 (in our case 𝜆=2).

Notice that the results depend on the value of 𝜆. If we change 𝜆, we will get a different result. In the plot below, we can observe how the estimated values of coefficients 𝛽 change depending on the chosen parameter 𝜆.

We estimated coefficients 𝛽1,…,𝛽𝑝, but what about the intercept term 𝛽0? We mentioned that assuming the variables have been centered to have a mean of zero before conducting ridge regression, the estimated intercept will take the form:

In our case:

Now, to calculate predictions, we need to apply the formula:

If we wanted to predict the values of the target variable using our model, we would obtain the following values:

We can now compare our predictions with the actual values to demonstrate that the calculations lead to good results.

Implementation — Ridge Regression

Now, we will focus on implementing our ridge regression. We will do it step by step using the Python language. We won’t use classes to avoid obscuring what’s most important, which is a thorough understanding of how the algorithm works. Thanks to this, beginners will be able to execute the code line by line and understand it.

To start, let’s generate a dataset, exactly the same as in the example from the previous chapter. Before that, let’s also import the necessary libraries.
We also need to remember to define the hyperparameter of our algorithm, which is the shrinkage parameter 𝜆. Just like in the example, let’s assume 𝜆=2.

import numpy as np
import pandas as pd
from numpy.linalg import inv

LAMBDA = 2

X = np.array([[0.8,  1.2,  0.5,  -0.7, 1.0],
              [1.0,  0.8,  -0.4, 0.5,  -1.2],
              [-0.5, 0.3,  1.2,  0.9,  -0.1],
              [0.2,  -0.9, -0.7, 1.1,  0.5]])

y = np.array([3.2, 2.5, 1.8, 2.9])

Of course, we must remember that before creating a ridge regression model, we need to standardize the predictors so that they all have a mean of zero and a standard deviation of 1.

X_scale = (X-X.mean(axis=0))/X.std(axis=0)

To build a ridge regression model essentially means to find the coefficients. From the theoretical part, we learned how we can obtain these coefficients. It is enough to use the formula:

For clarity, we will break down this section into smaller steps so that each operation on matrices can be traced.
The function inv() comes from the numpy.linalg module. It calculates the inverse of a matrix. Meanwhile, the function np.identity(5) creates an identity matrix of size 5×5 (because we have 5 variables).

# X*X^T + LAMBDA*I
x1 = np.matmul(X_scale.T, X_scale) + LAMBDA*np.identity(5)
# Transpose obtained matrix - (X*X^T + LAMBDA*I)^{-1}
x1_inv = inv(x1)
# ( (X*X^T + LAMBDA*I)^{-1} ) * X^T
x2 = np.matmul(x1_inv, X_scale.T)
# ( ( (X*X^T + LAMBDA*I)^{-1} ) * X^T ) * Y
coef = np.matmul(x2, y)
# Estimated coeficients
print(coef)

We obtained the following coefficients, exactly the same as in the example from the previous section.

[3.04939794, 2.56669771, 2.01326671, 2.77063765]

To calculate predictions, we need to apply the formula

From previous considerations we remember that 𝛽0 is simply the mean value of the target variable.

np.matmul(X_scale, coef)+y.mean()

Finally, here’s the entire code again:

import numpy as np
import pandas as pd
from numpy.linalg import inv

LAMBDA = 2    # shrinkage parameter

# Define dataset (X,y)
X = np.array([[0.8,  1.2,  0.5,  -0.7, 1.0],
              [1.0,  0.8,  -0.4, 0.5,  -1.2],
              [-0.5, 0.3,  1.2,  0.9,  -0.1],
              [0.2,  -0.9, -0.7, 1.1,  0.5]])

y = np.array([3.2, 2.5, 1.8, 2.9])

# Scale predictors
X_scale = (X-X.mean(axis=0))/X.std(axis=0)

# RIDGE REGRESSION MODEL - coefficients estimation
# X*X^T + LAMBDA*I
x1 = np.matmul(X_scale.T, X_scale) + LAMBDA*np.identity(5)
# Transpose obtained matrix - (X*X^T + LAMBDA*I)^{-1}
x1_inv = inv(x1)
# ( (X*X^T + LAMBDA*I)^{-1} ) * X^T
x2 = np.matmul(x1_inv, X_scale.T)
# ( ( (X*X^T + LAMBDA*I)^{-1} ) * X^T ) * Y
coef = np.matmul(x2, y)
# Estimated coeficients
print(coef)

# predictions
np.matmul(X_scale, coef)+y.mean()