Why lasso (L1 penalized regression) works

Scott Sauers
9 min readApr 14, 2024

--

We’re going to find out:

  • The algorithm of standard regression and lasso regression
  • The math behind lasso’s shrinkage
  • The visual intuition for lasso’s shrinkage

Lasso (Least Absolute Shrinkage and Selection Operator) is a method used in machine learning and statistics to select and estimate the coefficients of the variables in a linear regression model. It prevents overfitting by shrinking some of the coefficients to zero.

Suppose we have a dataset with 𝑁 observations, where each observation has 𝑝 predictor variables (also known as features or independent variables) denoted as 𝑥ᵢ = (𝑥ᵢ₁, …, 𝑥ᵢₚ)ᵀ and a corresponding response variable (also known as the dependent variable or target variable) denoted as 𝑦ᵢ.

Quick note on notation for beginners:

When we say 𝑥ᵢ, you can imagine a spreadsheet where i corresponds to the 𝑖-th row of the spreadsheet, which contains the values of the predictor variables (like columns) for the 𝑖-th observation. For example, if there are three predictor variables, we can look in row number i and find 𝑥ᵢ = (𝑥ᵢ₁, 𝑥ᵢ₂, 𝑥ᵢ₃). 𝑦ᵢ is just the final variable we want to predict. The ᵀ just means although we’re writing horizontally, we can do math with it like a column.

Together, 𝑥ᵢ = (𝑥ᵢ₁, …, 𝑥ᵢₚ)ᵀ indicates a row in which there are p prediction variables, which has the corresponding observations (values) for those variables.

In any regression, our goal is to find a linear relationship between the predictor variables and the response variable.

In a standard linear regression model, the relationship between the predictor variables and the target variable is:

𝑦ᵢ = 𝛼 + ∑ⱼ 𝛽ⱼ𝑥ᵢⱼ + 𝜀ᵢ

where:

  • 𝑗 is the index for the predictor variables (so 𝑗 will range from 1 to 𝑝).
  • 𝑦ᵢ is the target variable for the 𝑖-th observation
  • 𝛼 is the intercept term
  • 𝛽ⱼ is the coefficient for the 𝑗-th predictor variable
  • 𝑥ᵢⱼ is the value of the 𝑗-th predictor variable for the 𝑖-th observation
  • 𝜀ᵢ is the error term for the 𝑖-th observation; difference between the actual value of 𝑦ᵢ and the predicted value
  • ∑ⱼ is the sum over the predictor variables (index 𝑗), so we are adding up the products of 𝛽ⱼ and 𝑥ᵢⱼ for all 𝑗 (predictors) from 1 to 𝑝 (observations)

For standard regression, we find the values of 𝛼 and 𝛽ⱼ that minimize the sum (from i to N) of squared errors:

∑ᵢ (𝑦ᵢ — 𝛼 — ∑ⱼ 𝛽ⱼ𝑥ᵢⱼ)²

∑ⱼ 𝛽ⱼ𝑥ᵢⱼ is our estimator based on our predictor variables and 𝛽 coefficients, and 𝛼 is the intercept. Subtracting these from the true value, 𝑦ᵢ, gives us the error.

In other words,

𝑦ᵢ = 𝛼 + ∑ⱼ 𝛽ⱼ𝑥ᵢⱼ + 𝜀ᵢ

𝑦ᵢ —( 𝛼 + ∑ⱼ 𝛽ⱼ𝑥ᵢⱼ) = 𝜀ᵢ

𝑦ᵢ — 𝛼 — ∑ⱼ 𝛽ⱼ𝑥ᵢⱼ = 𝜀ᵢ

So to get the sum of squared error, we square the error terms and sum over all observations (all i).

Minimizing the error works well enough for simple regression, but with many predictors, we might overfit to the data.

So, how might we prevent this? One possibility is to rein in the coefficients. We could define some value, t, and say that the sum of the magnitudes of all coefficients are not allowed to be larger than t. We can define this by putting a constraint on 𝛽ⱼ:

∑ⱼ |𝛽ⱼ| ≤ 𝑡

where 𝑡 (which will be some positive number) is a tuning parameter that controls the amount of “shrinkage” applied to the coefficients.

This is pretty restrictive! We’re saying that even if the best fit in terms of minimizing error has coefficients which are very high, we’re going to use small coefficients.

Let’s consider the case where there are two predictor variables, so we will have two 𝛽 coefficients to estimate in order to predict a target variable.

We run a standard regression, and the coefficients come out to be 10 for both variables:

However, if we’re using lasso, we’ll impose a constraint on the absolute sum of the size of these coefficients. If we choose t = 4:

4 ≥ (|β₁| + |β₂|), so the closest we can get to our regression coefficient is β₁ = β₂ = 2.

Our final estimate of the coefficients would be 2 and 2, as indicated by the green dot. We are also “allowed” to put all of the weight into one coefficient, and have β₁ = 0 and β₂ = 4 (the small black dot) below. We don’t choose to do this, though, since it is further away from the red dot than the green dot is.

What if we plot all possible “allowed” β combinations by lasso, assuming that the standard regression result will be outside of our constraint?

What if we set t = 8? Then, the diamond is wider and we get closer to the estimate of the coefficients which minimize squared error (the red dot). With lasso, in this example, we might select the place on the diamond that gets us closest to the red dot.

Really, we are selecting the black dot which will minimize squared error most (and it might not always be the closest to the red dot, if minimizing squared error on the diamond leads to a different relative coefficient weights than at the red dot). Here’s a nice visualization from this blog:

Above, the squared error loss function (contour) is symmetric (concentric circles). We select the pink dot when we do lasso, and we select the black dot when we do standard regression.

You’ll notice that the vertices of the diamond stick out more, and in three of the above examples, the point on the diamond that minimizes squared error is a vertex. At the vertex, one coefficient is as large as it can be (within the diamond constraint) and the other is zero. For t = 4, this could mean we set the first coefficient to 4 and the second to 0. This moves us a distance of 4 units away from the origin on the graph. If we selected 2 and 2, we would move a distance of only the hypotenuse: √(2² + 2²) = 2.828…, which is less than 4. This means for a great number of possible loss functions, it is beneficial to select a vertex of the diamond because it sticks out more from the origin than anywhere else on the diamond. This allows us to minimize squared error most at the vertex, unless the area with the best minimization is close to the origin or equally favors two variables. (By the way, in these examples, we have been choosing locations on the diamond, but we are allowed to choose locations in the diamond if it’s beneficial, too.)

If we have three variables, the constraint region will be an octohedron with even more vertices:

Above: t ≥ |β₁| + |β₂| + |β₃|.

Putting all of this together, lasso estimate (𝛼̂, 𝛽̂) is obtained by solving:

(α̂, 𝛽̂) = argmin ∑ᵢ(yᵢ — α — ∑ⱼβⱼxᵢⱼ)², subject to the constraint ∑ⱼ |𝛽ⱼ| ≤ 𝑡.

Here, 𝛼̂ is the estimated intercept, and 𝛽̂ = (𝛽̂₁, …, 𝛽̂ₚ)ᵀ are the estimated coefficients for the predictor variables.

This means we select the values for 𝛼̂ and 𝛽̂ that minimize the function, but we’re forcing the sum of the 𝛽 coefficients to add up to, at most, 𝑡.

The lasso method assumes that the predictor variables are standardized, meaning that for each predictor variable 𝑗, the mean ∑ᵢ 𝑥ᵢⱼ / 𝑁 = 0 and the variance ∑ᵢ 𝑥²ᵢⱼ / 𝑁 = 1. (Mean of each predictor is zero, with unit variance.) This is so the scale of the predictor variables does not affect the lasso solution.

You’ll notice that we haven’t talked much about the tuning parameter 𝑡, which controls the amount of shrinkage applied to the coefficients.

When 𝑡 = 𝑡₀ = ∑ⱼ |𝛽̂ⱼ| or larger, no shrinkage is applied (t is large enough to contain the sum of the coefficients), and the lasso solution is equivalent to the ordinary least squares solution. As 𝑡 decreases from 𝑡₀, the coefficients start shrinking towards zero. As we saw, some coefficients may become exactly zero, so it is effectively performing variable selection at the same time as regularization. A smaller value of 𝑡 results in a simpler model with fewer non-zero coefficients, while a larger value of 𝑡 allows for more complex models (at risk of overfitting).

Implementing lasso is a bit different, because we have to select 𝑡 in a smart way. This post explains how lasso is different in practice:

The math used to implement regularization does not correspond to pictures commonly used to explain regularization. . . . [The math and the visualizations] show[] how we regularize models conceptually, with hard constraints, not how we actually implement regularization, with soft constraints!

The way the math has been set up so far is the figure on the left below. It’s a “hard” constraint in the sense that we are forcing the coefficients to be in the blue region. However, in implementation, it’s more of a suggestion.

A suggestion seems useful, because what if the least-squares estimate is very far away from our region? If that’s the case, perhaps our t was too restrictive and we may underfit. But with a hard constraint, this information isn’t taken into account (during a single optimization fit, at least).

How might we formalize this idea of a “suggestion” instead of a rule? The same post discusses this too (I recommend checking it out).

To implement a loss function with a hard constraint, we could use a gradient descent minimization approach as usual. The problem is that, at each step moving us closer to the loss function minimum, we’d need an IF-statement asking if had exceeded the constraint. Certainly that would work, but it would definitely slow down the computation. . . .

An alternative approach is to convert the constraint into just another term of the loss function, but at the cost of making it a soft constraint, not a hard constraint.

One approach is to tell the function to minimize both the squared error and the sum of predictor variable coefficients at the same time.

We want to find the α̂ and 𝛽̂ which minimize squared error plus coefficient values:

(α̂, 𝛽̂) = argmin ∑ᵢ(yᵢ — α — ∑ⱼβⱼxᵢⱼ)² + ∑ⱼ |𝛽ⱼ|.

An issue with the above implementation is that we lose the ability to control how much the model cares about reducing the coefficient magnitudes. When we used t, we could try out different values to see what worked best. We should include a way to strengthen or weaken the importance of minimizing the coefficient sizes. We can multiply the sum of coefficient weights term by a tunable hyperparameter, λ, so that when the hyperparameter is small, the model cares less about minimizing the coefficient sizes:

(α̂, 𝛽̂) = argmin ∑ᵢ(yᵢ — α — ∑ⱼβⱼxᵢⱼ)² + λ∑ⱼ |𝛽ⱼ|.

That’s it! We’ve now implemented lasso. We can choose the value of λ that works best for our application or use cross-validation to select the value of λ that performs best.

Depending on λ, the estimated coefficient values will lie between the origin and the minimum squared error estimate. As λ approaches infinity, the coefficients will approach zero, and as λ approaches zero, the coefficients approach the minimum squared error estimate. The figure from this post illustrates this:

I hope this gives some mathematical intuition behind how lasso works and why it sets the coefficients of predictor variables to zero.

(You’ll notice that we glossed over why the soft and hard constrains are equivalent. To understand why, read this article. In short, it’s because the sum of lasso coefficients increases from zero to the sum of the least squares coefficient estimates as λ decreases from infinity to zero, and the sum of lasso coefficients is continuous with respect to λ, so there must be some λ along the way in which the hard constraint estimate is equivalent to the soft constraint.)

--

--