Intuition behind L1-L2 Regularization

Manas Mahanta
Analytics Vidhya

--

Regularization is the process of making the prediction function fit the training data less well in the hope that it generalises new data better.That is the very generic definition of regularization .Today we will dig very deep into two famous regularization of machine learning i.e. L1 & L2 Regularization with a pure mathematical eye.

Before digging into regularization techniques we will build up the base and the first thing that we can ponder upon is hypothesis space . In order to capture the learning from X to Y (f:X→Y) in ML we generally come up with some hypotheses h1,…,hn where h∈H. Here, H is your hypothesis space or set.All these hypotheses or prediction functions have a degree of complexity associated with them.The degree of complexity can be measured in number of features or degree of polynomial or depth of decision tree.

Fig 1:Different measures of complexities

Given a hypothesis space F, consider all functions in F with complexity at most r:

Fr ={f ∈F|Ω(f)≤r}

Fr is the subset of hypothesis space F with complexity ≤ r . Here r is a hypertuning parameter that we can learn from cross validation stage and work the best r out for our prediction function.

Now we jump onto the first form of constrained regularisation called Ivanov regularization.In the below equation we are minimizing the empirical risk that is our loss function constrained on the hypothesis space with its complexity not greater than r .

Fig2: Ivanov form of regularisation

The second form of penalised regularization is called Tikhonov regularization where we add the complexity of our prediction function multiplied by a hyperparameter 𝝀 to our objective function.

Fig3:Tikhonov form of regularisation

We can intuitively see it as when we get a hypothesis with high complexity it will shoot the objective function higher whose primary purpose is to be minimized.The magnitude of 𝝀 balances the trade-off between the complexity of hypothesis we are going to find and how well we fit the data or how small the empirical risk is.The main difference between constrained and penalised form of regularization is in the former we are already constraining the hypothesis space which makes it easier for the optimising function to minimize the loss while in the later we have to run through many batches to get it regularized and minimized.We can say both the form are equivalent if all the possible values we get by tuning r is equal to all the possible values we get by tuning 𝝀.

Fig 4: Ridge Regression in Tikhonov form
Fig 5: Ridge Regression in Ivanov form

We can see the equations of both ridge regression in Tikhonov and Ivanov form and the same applies for lasso regression.

Fig 6: Regularization path for Ridge Regression

In first part of Fig 6 we can see the path of feature coefficients from more regularized to less regularized .The ratio between constrained and unconstrained coefficients tell us the amount of regularization we are doing on the model.The values on the x-axis are derived from the ratio where we take different values of r. Initially the coefficients are very small owing to its constrained boundaries where it is penalised heavily and as we move further the coefficients become unconstrained and the empirical risk is at its minimum.

Fig 7:Regularisation path for Lasso Regression

For Lasso Regression we can see the coefficients of some features are 0 for some time when the model is regularized heavily. This is USP of Lasso Regression where it provides sparse matrices of coefficients which can be used in a variety of ways like identifying the important features for a slower non-linear model,less memory to store features,can provide better prediction sometimes.And why does lasso regression give sparse models we will touch it in the next section

Why lasso regression gives sparse solutions ?

Fig 8:Parameter Space for L1 regularisation with w1 and w2 as its axes

The constrained rhombus gives us the coefficients w1+w2 ≤ r while w-hat is the unconstrained coefficient space where empirical risk is at its minimum.As we move the unconstrained space further and make it more constraint it evolves into an ellipsoid and touches our constraint space either on the axis or on the side-lines. The first point of contact gives us the solution for our coefficients and if it touches the axis we get sparse solution and if it touches the side we don’t get sparse solution. Now the point is what is the probability that it will give us sparse solution more often.

Fig 9:Reason behind sparsity of Lasso Regression

We can see from Fig 9 whenever the w-hat unconstrained space lies in the region between red and green it will touch the constrained space on side lines and give us non sparse solutions but whenever it lies in either red or green space it will touch the constrained space on axis and give us sparse solutions.This applies for all four quadrants of parameter space.And the probability of w-hat’s center lying on the region between 𝛂-1 and 𝛂-2 is very less compared to the region that is covered by both 𝛂-1 and 𝛂-2.

Fig 10: Different forms of constraint space
Fig 11:Matrices are even sparse when q≤1

Thus we covered the different forms of regularization and the reason why lasso-regression gives sparse matrices.Stay tuned for more in-depth articles on Machine Learning.

Courtesy:David Rosenberg(Bloomberg)

--

--