MOST COMMON DATA SCIENCE QUESTION?(PART 1)

Nishesh Gogia
Analytics Vidhya
Published in
4 min readSep 6, 2020

WHY L1 REGULARIZATION CREATES SPARSITY COMPARED TO L2 REGULARIZATION?

So in machine learning, we have two types of regularization. L1(lasso) and L2(ridge) regularization, so in this article we gonna look at following points.

  1. WHAT IS SPARCITY?
  2. WHAT IS THE DIFFERENCE BETWEEN L1 AND L2 REGULARIZATION?
  3. WHY L1 CREATES SPARCITY AND WHERE IT IS USEFUL?

I am assuming that you know what is regularization, if you do not know what is regularization then understand it like a extra term in your optimization problem which helps to reduce overfitting.

Below is a picture of L2 regularization in logistic regression optimization problem, i have clearly marked the regularization term there.

NOW WHAT IS SPARCITY??

Sparcity or sparce matrix are the matrix in which most of the values are “0”, understand it like that, you want to find the weight vector (w*) in logistic regression, so it will look like

w*=<w1,w2,w3….wd> if there are d dimensions.

Now understand it like that, d dimension means d features, so we have a problem in which there are d features, now w1 can be understand as the weight for feature 1, w2 can be understand as weight of feature 2, and so on.

Lets say we get to know that some of the features are less important and not contributing to the target (y), if i use L2 regularization then the weight values for these features will be less and if i use L1 regularization then the weight values for these will be 0, it simply means L1 regularization is giving a wight vector w* with most of the values as “0” ( we are assuming only few features are important from d features and thats why L1 makes them 0)

S0 i think its very clear that L1 regularization creates sparse matrix(means most of the weight values will be 0 in the matrix)

WHAT IS THE DIFFERENCE BETWEEN L1 AND L2 REGULARIZATION?

I will go for L2 regularizaion first with a very simple approach and i hope in the end you will automatically give the answer why L1 regularization gives us sparse matrix.

We know the how generally a optimization problem looks like,

(there is some loss function + there is some regularization term).

To make computation simple, lets assume there is only one weight w1 as shown in picture 1,if we can check on w1 it will be applied on all the d dimensional weights. so lets take only w¹² right now.

So we have a problem in which we have a function f(x)=w1² and we want to find the w1 value which gives us the minimum value for f(x), lets plot f(x) vs w1, it forms a y=x² plot as shown in picture 1 image 1

And we already know to get the minima or maxima we need to find the derivate and put in gradient descent update function. So lets find the derivative of f(x) with respect to w1, it comes out to be 2w1.

Lets plot the derivative,

Refer picture 1 image 2

lets put the derivative into the gradient function,

(w1)j+1=(w1)j-r(df(x)/d(w1))

after putting the derivative value,

(w1)j+1=(w1)j-r(2w1), here r is the step size lets take step size be 0.01 and lets take (w1)(0)=0.05, ((w1)(0 )is the first value randomly choosen where j=0)

so (w1)1=(0.05)(0)–0.01*(2*0.05)(0), her (0) means that j=0.

(w1)1=0.049, we can see that there is a minor change from w1(0) to w1(1), w1(0) was 0.05.

if you see the derivate plot, we can see slope is continuosly reducing and as you keep iterating in L2, very quickly we will notice that L2 regularization does not change the value of w1 from one iteration to another iteration and that is because of the same reason(SLOPE IS CONTINOUSLY REDUCING IN L2 REGULARIZATION).

As iteration number increases, w1(j) decreases and as you come closer to w*, derivative becomes smaller and smaller. This is the reason that chances that w1=0 are less. Compared to L1, less features will be zero in L2.

L1 REGULARIZATION

Now lets see what is happening in L1 regularization, why it is giving sparse matrix, so same loss function so we are not bothering about that loss function, so in L2 regularization term there is no square term, so when we take derivative of it, we gets a constant as shown in picture 1, image 3 and image 4.

so in the derivative term there is no “w” term, that means when we will apply gradienr descent as in the picture 2, there will be a big change in w1(j+1) compared to w1(j) because the difference part is real small.

Here iteration number increases, being differentiable term is constant which is 1 for positive and -1 for negative. And l1 regularization continues to constantly reduce w1 towards “0”.

So basically slope is constant chances to get w=0 in few iteration is very possible.

so that is it for this article, Thanks for Reading.

--

--