Chawannut Prommin
4 min readDec 27, 2017

Two ways to construct Ridge regression.

I guess you already know that Ridge regression is one type of regression where we penalize coefficients of the model by using L2 norm. As opposed to L1, L2 does not lead to sparsity but adding some bias to the model is mainly done to avoid over-fitting. Most resources talk about Ridge regression by adding extra term to the cost function. To get more intuition of Ridge regression, this post will concern with two ways where we can construct Ridge regression from both frequentist and Bayesian perspective.

  1. Frequentist approach: start with normal regression.

In standard regression, we minimize the function.

which lead to the following solution

Now note that we will have a problem when the multiplication of the first two term is not invertible. One way to fix this problem is to add extra rows to the design matrix X. If we write out the new design matrix X in term of block matrix, this will be

Intuitively speaking, we extend the original design matrix with diagonal matrix which has number of rows equal the total number of columns in the original design matrix (p is total number of feature in the original design matrix with the first column for the intercept. Hence, there are p+1 columns in the original design matrix)

Here is an example

1

We almost done here ! Since we only change the design matrix, the formula to find the best betas is still the same. We just do the math here and get

and then the final solution is

So, by using this trick we manage to avoid non-invertible problem since the term inside the inverse here is always invertible (To be precise, it is positive definite and invertible.)

2. Bayesian approach.

One main different between frequentist and Bayesian is that frequentist believe that the “real” solution is fixed. That is we want to estimate betas which are fixed and unknown…..but wait! ….why do we have standard error in the regression model then if beta are fixed ?…. This is because the “true” model is fixed but the “sample” (or data) we get to find beta are random. As a result, each time we run regression with different sample (data), the estimated betas will be different.

Mathematically speaking, the “true” model is

We treat beta as fixed and the error term as random. That is

While in the Bayesian approach, we make one more assumption on the coefficients to make it “random”. To do this, we assume it’s normally distributed.

The goal here will be finding the posterior distribution of Beta given the data (Y). For mathematical convenient, we will derive the following term instead (which can be shown to be equivalent of the posterior)

One nice way to derive this is to first consider the joint distribution

and then use the conditional Gaussian distribution to find the posterior distribution.

I won’t go into detail here but doing it this way is a lot easier than the “completing the square” method which is quite tedious. One thing to keep in mind here is that the posterior is a distribution. So, in order to get the point estimate, we will use expectation of the posterior distribution. After some math, the expectation of the posterior distribution will be.

which is exactly the same as the first approach !

So we just show two non standard ways where you can get the ridge regression solution. Some might be wondering that can we do the same but for Lasso regression (L1)?……Yes we can !! by using Laplace distribution on the beta instead of normal distribution (and the math will me more convoluted.)

Hope this is useful !

Chawannut Prommin

Grad@Cornell. Interested in interpretable ML, A/B testing and statistics.