L1 and L2 Regularization.

4 min readNov 11, 2018

Logistic Regression basic intuition :

Logistic regression is classification technique. Assumption that the logistic regression will make is that the classes are almost or perfectly linearly seperable. The task is to find hyper plane which is best in seperating the classes(positive class or negative class). It is a multiclass classifier i.e. it can be used for problems with more than two classes.

Most of the Machine Learning engineers and data scientists use Logistic Regression as base line model.

Here class label : 0, represents negative class and class label : 1, represents positive class and the line which is seperating the points is best hyper plane with normal as w.

w* = argmax∑i=1n yi*W^T*xi

w* is the best or optimal hyper plane which maximizes the sum of yi*W^T*xi

W^T means W transpose, W is normal to the hyper plane which we are dealing with and it is represented as a row vector.

optimization problem i.e.

w* = minimization of summation[log(1+exp(-zi))] →equation(1)

where zi = yi*W^T*xi is also known as signed distance.

if we pick W such that all the training points are correctly classified and all the zi tends to +infinity then we get the optimal w*.

If the all training points are correctly classified then we have problem of overfitting (means doing perfect job on training set but performing very badly on test set, i.e. errors on train data is almost zero but errors on test data are very high) and also if each zi tends to infinity then we will have the same problem to overcome this problem we use regularization techniques.

Regularization :

Regularization is a technique used to prevent overfitting problem. It adds a regularization term to the equation-1(i.e. optimisation problem) in order to prevent overfitting of the model.

The regression model which uses L1 regularization is called Lasso Regression and model which uses L2 is known as Ridge Regression.

Ridge Regression (L2 norm).

L2-norm loss function is also known as least squares error (LSE).

w* = minimization of ∑i=1n[log(1+exp(-zi))] + λ*∑ (wj )²

∑ (wj )² is a regularization term and ∑ [log(1+exp(-zi))] is the Loss term. λ is a hyper parameter.

We added the regularization term(i.e. squared magnitude) to the loss term to make sure that the model does not undergo overfit problem.

Here we will minimize both the Loss term and the regularization term. If hyper parameter(λ) is 0 then there is no regularization term then it will overfit and if hyper parameter(λ) is very large then it will add too much weight which leads to underfit.

We can find the best hyper parameter by using cross validation.

Lasso Regression (L1 norm)

L1-norm loss function is also known as least absolute deviations (LAD), least absolute errors (LAE).

In L1 regularization we use L1 norm instead of L2 norm

w* = argmin ∑[log(1+exp(-zi))] + λ* ||w||1

Here the L1 norm term will also avoid the model to undergo overfit problem. The advantage of using L1 regularization is Sparsity.

Sparsity:

A vector(w in this case) is said to be sparse when most of its cells(wi’s in this case) are zero.

w* is said to be sparse when the most of the wi’s are zeros.

If we use L1 regularization in Logistic Regression all the Less important features will become zero. If we use L2 regularization then the wi values will become small but not necessarily zero.

Here I am writing the code to check how the sparsity increases with increase in the hyper parameter value.

Here, we are going to check how sparsity increases as we increase lambda (or decrease C, as C= 1/ λ) when L1 Regularizer is used.

In code hyper parameter C is Inverse of regularization strength; It must be a positive float.

Note :

I have done all the loading of amazon fine food reviews data set and did all the data preprocessing on it and splitted the data as train (80%) and test (20%).

Now, I’m fitting the model for train data from amazon fine food reviews data set using Logistic Regression.

code :

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C= 1000, penalty= ‘l1’)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)
print(“Non Zero weights:”,np.count_nonzero(clf.coef_))

Non Zero weights: 50470

clf = LogisticRegression(C= 100, penalty= ‘l1’)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)
print(“Non Zero weights:”,np.count_nonzero(clf.coef_))

Non Zero weights: 40864

clf = LogisticRegression(C= 10, penalty= ‘l1’)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)
print(“Non Zero weights:”,np.count_nonzero(clf.coef_))

Non Zero weights:  27295

clf = LogisticRegression(C= 1, penalty= ‘l1’)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)
print(“Non Zero weights:”,np.count_nonzero(clf.coef_))

Non Zero weights: 8228

clf = LogisticRegression(C= 0.1, penalty= ‘l1’)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)
print(“Non Zero weights:”,np.count_nonzero(clf.coef_))

Non Zero weights: 1769

clf = LogisticRegression(C= 0.01, penalty= ‘l1’)
clf.fit(x_train,y_train)
pred = clf.predict(x_test)
print(“Non Zero weights:”,np.count_nonzero(clf.coef_))

Non Zero weights:396

We can observe that as the lambda value increases the sparsity also increases.

Observations :

If we use L1 regularization in Logistic Regression all the Less important features will become zero.

If hyper parameter(Λ) is 0 then there is no regularization term then it will overfit and if hyper parameter(Λ) is very large then it will add too much weight which leads to underfit. In L2 regularization.

The function of both the regularization methods are almost the same. The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

References :

https://www.appliedaicourse.com

http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/

https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression

L1 and L2 Regularization.

Written by Aditya .P