Machine learning and Mathematics

Understanding L1 and L2 regularization with analytical and probabilistic views

Derive L1 and L2 regularization via analytical and probabilistic solution

Yuki Shizuya

Published in

Intuition

11 min readMay 25, 2024

When you learn about machine learning, you’re bound to come across L1 and L2 regularization. I think many awesome blogs explain these concepts intuitively with visualizations. However, only a few blogs explain L1 and L2 regularization with analytic and probabilistic views in detail. So, I decided to write about both regularizations with both perspectives. In this blog, I will introduce L1 and L2 regularizations with detailed mathematical derivations and visualization to help you understand these concepts well.

The overview of regularization
L1 regularization

2.1 Analytical derivation of L1 regularization
2.2 Probabilistic derivation of L1 regularization

3. L2 regularization

3.1 Analytical derivation of L2 regularization
3.2 Probabilistic derivation of L2 regularization

1. The overview of regularization

Firstly, let’s recap the concept of regularization. To understand concretely, let’s assume we have small data with two dimension below [1].

As you can see, this data is non-linear. Since the data is non-linear, we can easily guess that the simple linear regression doesn’t fit this data well. In this case, let’s consider polynomial regression to express non-linear data. To understand the importance of regularization, we use 15 polynomial regression, meaning we use an overly complex function to predict data.

I use the PolynomialFeatures and Ridge classes in the scikit-learn library to fit the data. The result is shown below.

(left) The predicted result without regularization (right) The predicted result with regularization from the author

The left figure shows the polynomial regression without regularization, and the right one shows the polynomial regression with regularization. The polynomial regression without regularization overfits the data because we used an overly complex function compared to the original data. On the other hand, the polynomial regression with regularization can fit better than the one without regularization and reduce the complexity of the model shape. Generally, we use regularization like the example above to prevent the model from overfitting.

How can we do regularization? Theoretically, we add the regularization term to the objective function and optimize the parameters based on it, as shown below.

The regularization term imposes a penalty for the coefficient values not increasing. Why do we impose a penalty for the coefficients? To understand it intuitively, let’s look at the coefficients of the previous example.

The above line shows the coefficients of polynomial regression without regularization, and the below line shows the coefficients of polynomial regression with regularization(L2). The polynomial regression without regularization has larger coefficient values. Intuitively, if the model has larger coefficient values, it means that the model can change its shape drastically. Thus, the model without regularization fits the given data more accurately but not generally. Meanwhile, the model with regularization can search parameters to fit the given data more generally because the coefficient values are relatively small.

So far, you understand the concept of the regularization and its power. Now, let’s dive into the theoretical background behind regularization.

2. L1 regularization

L1 regularization [2] adds the absolute value of the magnitude of the coefficient, or the l1-norm of the coefficients, as the regularization term. L1 regularization helps the feature selection of the coefficients, which means it can reduce the number of irrelevant independent variables. Specifically, regression model with L1 regularization is called Least Absolute Shrinkage and Selection Operator (Lasso) regression. The formula of L1 regularization is below.

where w is the parameter. From now on, we will learn how to solve this problem.

2.1 Analytical derivation of L1 regularization

How can we optimize the L1 regularization formula? To solve it analytically, this formula can be seen as constraint optimization with Lagrange multipliers.

The optimization problem with L1 regularization

As you can see, the last equation is the same as the L1 regularization formula. Next, how do we optimize parameters analytically based on the derived Lagrangian? Unfortunately, we generally cannot have a closed solution for L1 regularization because we cannot differentiate the regularization term. You can check this fact in the figure below. Assume that we have two parameters function, and the L1 regularization term can be depicted as:

When we calculate the derivative on the edge, it has two different values from right and left side, so we cannot differentiate on the edge (You can check more mathematical detail in [7]). Although there are a few exceptions to finding closed-form solutions, we can find a closed-form solution when the matrix X is orthonormal [3], and the number of parameters is one. However, such a situation rarely happens in practical analysis.

Then, how do we find parameters? The most common methods for solving the lasso problem are subgradient and coordinate descent. In the scikit-learn implementation, they use coordinate descent to optimize the Lasso problem, so let’s learn the coordinate descent [4].

The coordinate descent idea is straightforward. Assuming we have an n-dimensional function f, we minimize f by successively minimizing each parameter dimension of f repeatedly. Let’s see the mathematical definition.

As its name suggests, we calculate parameters individually and update them based on past values. This process continues until convergence or the maximum repetitive count we set is achieved. It is very simple, isn’t it? Now, let’s look at a concrete example of the Lasso formulation. We consider the Lasso problem with mean-squared error(MSE). So, the formula will be as follows:

Since we proceed with the coordinate descent, we must minimize this formula over each parameter with other non-target parameters fixed.

The coordinate descent update formula for MSE term

The last formulation is a bit tricky (at least for me). When you consider each dimension of the term, you can understand that only the i-th row of the transpose matrix of X is relevant to the i-th gradient. To formulate the i-th parameter gradient, we need to reformulate the above formula.

The gradient of the coordinate descent update formula for MSE term

To obtain the parameter update equation, we need to decompose the XB term. We divide it into the i-th column of the XB and the other columns. As you can see, we can derive the parameter-update formula. How about the L1 regularization term? We will introduce soft-thresholding to solve it. Since we cannot differentiate the L1 term, we utilize the subdifferentials and subgradients to approximate it. For the concept of the subgradients, you can check the theorem in the [5]. Using subgradients, we can derive as:

We substitute the L1 regularization term for the Lasso problem and get:

We can reformulate the last equations to solve for 𝛽 as:

We can organize the conditional branch using the sign and max function. Now, we update our parameters based on this final formula repeatedly until convergence. Let’s solve a concrete example. For the visualization, we assume to optimize the function with two parameters and no bias term.

We use 𝜆 = 1 as the regularization strength and update the parameter values based on the equation we derived.

As you can see, the convergence is fast in this case. The coefficients’ values are almost the same as the scikit-learn implementation.

# coordinate descent from scratch
b0 = 1.00, b1 = 2.00

# scikit-learn
b0 = 1.11, b1 = 2.04

For the practical case, you should use the scikit-learn implementation because they use cython to compute the coordinate descent; their implementation is much faster than the native Python code. Later, I will share the code I used with you.

2.2 Probabilistic derivation of L1 regularization

Before diving into the L1 regularization with the probabilistic aspect, we should know the MAP estimation. Let’s learn the MAP estimation and its difference between maximum likelihood estimation. If you are already familiar with it, you can skip the next section.

Prerequisite — Maximum likelihood estimation and MAP estimation

We assume that we have the multiple linear function, and the error between the observed data point and the predicted value follows the Gaussian distribution with the mean 0 and the standard deviation 𝜎. The likelihood, or probability density that the error takes under the assumed distribution can be derived as follows:

where the number of the data points is n, and the number of parameters is p. You want to find the parameters that maximize this probability density, which is called maximum likelihood estimation (MLE). We can write the MLE formulation as follows:

In general, we take a logarithm to change the multiplying to the summation. In frequentist statistics, we consider the parameters as constant values but unknown, so we find the parameters to maximize the likelihood using differentiation.

On the other hand, we can take the parameters as random variables in Bayes’ theorem. Applying Bayes’ theorem, we can view the likelihood as follows:

The posterior probability is proportional to the likelihood and prior probability. In this setting, we should maximize the posterior probability rather than the likelihood like MLE. Maximizing the posterior probability is called maximum a posteriori probability estimate, or MAP estimation. We can formulate as follows:

When we apply it to our previous example in contrast to the MLE:

You realize that MAP estimation needs the prior probability. We can use any probability distribution for it. Now, back to the L1 regularization.

Probabilistic derivation of L1 regularization using Laplace prior

When we choose Laplace distribution for the prior, the MAP estimation becomes the L1 regularization formula. Let’s derive it! Laplacian priors is one of the probability distribution that has the shape below.

You can notice the exponent term of the exponential function is similar to the L1 regularization term. Now, we substitute the Laplace prior with mean 0 for the prior probability in the MAP estimation.

As you can see, the last formula is the same as the L1 regularization! Intuitively, the shape of the Laplacian distribution is much sharper than the Gaussian distribution. It is similar to the L1 regularization term in the analytical derivation section as shown below.

(Left) L1 regularization term (right) Laplace distribution

So far, we understand L1 regularization derivation using both analytical and probabilistic perspectives. Next, Let’s learn L2 regularization.

3. L2 regularization

L2 regularization adds the squared values of coefficients, or the l2-norm of the coefficients, as the regularization term. L2 regularization helps to promote smaller coefficients. A regression model with L2 regularization is called Ridge regression. The formula of L2 regularization is below.

3. 1 Analytical derivation of L2 regularization

Like the L1 regularization, we see the L2 regularization problem as constraint optimization with Lagrange multipliers.

The oprimization problem with L2 regularization

The last equation is the same as the L2 regularization formula. In contrast to the L1 regularization, this formula can be differentiated. So, we don’t need to introduce new concepts; we just differentiate them!

The derivation of the parameters with differentiation

Now, we obtain a closed-form solution for L2 regularization. Let’s implement it and compare the scikit-learn ridge results.

# sample data
X = np.random.randn(100, 2)
beta = np.array([2, 3]).reshape(1, 2)
Y = X @ beta.T + np.random.normal(beta.shape[0])

lam = 1.0
inv_mat = np.linalg.inv(X.T @ X + np.eye((X.T @ X).shape[0]))

ridge_coef = inv_mat @ X.T @ Y
print(ridge_coef.reshape(-1)
# [1.998, 2.937]

ridge = Ridge(alpha=1.0)
ridge.fit(X, Y)
print(ridge.coef_.reshape(-1)
# [1.979, 2.973]

You can see we obtain almost the same results.

3.2 Probabilistic derivation of L2 regularization

We consider MAP estimation again. When we derive L1 regularization, we use the Laplace distribution as a prior. In the L2 regularization case, we utilize the Gaussian distribution with 0 mean as a prior.

You notice the exponent term of the exponential function is similar to the L2 regularization term. Now, we substitute the Gaussian prior with mean 0 for the prior probability in the MAP estimation.

As you can see, the last formula is the same as the L2 regularization. Intuitively, the shape of the Gaussian distribution has a gentler curve than that of Laplace prior. So, it is also similar to the L2 regularization term.

(Left) L2 regularization term (right) Gaussian distribution

Finally, I will share you the code used in this blog.

In this blog, I introduced the detailed derivation of L1 and L2 regularization via analytical and probabilistic views. I hope this blog helps you to understand the mathematical background of regularizations. Thank you for reading.