Visualizing regularization and the L1 and L2 norms
Why does minimizing the norms induce regularization?
If you’ve taken an introductory Machine Learning class, you’ve certainly come across the issue of overfitting and been introduced to the concept of regularization and norm. I often see this being discussed purely by looking at the formulas, so I figured I’d try to give a better insight into why exactly minimising the norm induces regularization — and how L1 and L2 differ from each other — using some visual examples.
Prerequisite knowledge
- Linear regression
- Gradient descent
- Some understanding of overfitting and regularization
Topics covered
- Why does minimizing the norm induce regularization?
- What’s the difference between the L1 norm and the L2 norm?
Recap of regularization
Using the example of linear regression, our loss is given by the Mean Squared Error (MSE):
and our goal is to minimize this loss:
To prevent overfitting, we want to add a bias towards less complex functions. That is, given two functions that can fit our data reasonably well, we prefer the simpler one. We do this by adding a regularization term, typically either the L1 norm or the squared L2 norm:
So, for example, by adding the squared L2 norm to the loss and minimizing, we obtain Ridge Regression:
where λ is the regularization coefficient which determines how much regularization we want.
Why does minimizing the norm induce regularization?
Minimizing the norm encourages the function to be less “complex”. Mathematically, we can see that both the L1 and L2 norms are measures of the magnitude of the weights: the sum of the absolute values in the case of the L1 norm, and the sum of squared values for the L2 norm. So larger weights give a larger norm. This means that, simply put, minimizing the norm encourages the weights to be small, which in turns gives “simpler” functions.
Let’s visualize this with an example. Let’s assume that we get some data that looks like this:
What function should we pick to fit this data? There are many options, here are three examples:
Here we have a 2nd-degree polynomial fit and two different 8th-degree polynomials, given by the following equations:
The first two (which are “simpler” functions) will most likely generalise better to new data, while the third one (a more complex function) is clearly overfitting the training data. How is this complexity reflected in the norm?
As we can see, line [c] has a mean squared error of 0, but its norms are quite high. Lines [a] and [b], instead, have a slightly higher MSE but their norms are much lower:
- Line [a] has lower norms because it has significantly less parameters compared to [c]
- Line [b] has lower norms because despite having the same number of parameters, they’re all much smaller than [c]
From this we can conclude that by adding the L1 or L2 norm to our minimization objective, we can encourage simpler functions with lower weights, which will have a regularization effect and help our model to better generalize on new data.
What’s the difference between the L1 norm and the L2 norm?
We’ve already seen that to reduce the complexity of a function we can either drop some weights entirely (setting them to zero), or make all weights as small as possible, which brings us to the difference between L1 and L2.
To understand how they operate differently, let’s have a look at how they change depending on the value of the weights.
On the left we have a plot of the L1 and L2 norm for a given weight w. On the right, we have the corresponding graph for the slope of the norms. As we can see, both L1 and L2 increase for increasing asbolute values of w. However, while the L1 norm increases at a constant rate, the L2 norm increases exponentially.
This is important because, as we know, when doing gradiant descent we’ll update our weights based on the derivative of the loss function. So if we’ve included a norm in our loss function, the derivative of the norm will determine how the weights get updated.
We can see that with the L2 norm as w gets smaller so does the slope of the norm, meaning that the updates will also become smaller and smaller. When the weights are close to 0 the updates will have become so small as to be almost negligible, so it’s unlikely that the weights will ever become 0.
On the other hand, with the L1 norm the slope is constant. This means that as w gets smaller the updates don’t change, so we keep getting the same “reward” for making the weights smaller. Therefore, the L1 norm is much more likely to reduce some weights to 0.
To recap:
- The L1 norm will drive some weights to 0, inducing sparsity in the weights. This can be beneficial for memory efficiency or when feature selection is needed (ie we want to select only certain weights).
- The L2 norm instead will reduce all weights but not all the way to 0. This is less memory efficient but can be useful if we want/need to retain all parameters.
Found this story helpful? Consider subscribing to Medium to support writers!