L1 & L2 Regularization, plus Dropout!

Juan Vera
10 min readJun 11, 2024

--

Personal notes and practical implementations from scratch.

Don’t let this become the surface of your loss func.… regularize!

Regularization, a blessing for it’s existence, is a technique that aims to mitigate overfitting on a dataset and variance amongst differing datasets, by penalizing complexity by modifying your loss function through the addition of a penalty, as is the case in L1 / L2 regularization

Complexity, meaning when the parameters, (θ), of your neural network are optimizing to overfit to your dataset.

( Here, I’ll be referring to θ as any given parameter, a weight or bias )

As for the case of Dropout, rather than adding a penalty to the loss function, it instead “eliminates” or “drops” a random set of parameters during each forward pass or training step.

Of course, there are many different methods of regularizing θ values, but 3 of them have stood out to me, to be the foundational means to ensuring that a model doesn’t overfit on our training dataset.

  • L1 Regularization (though not used as often)
  • L2 regularization
  • Dropout

These methods are key to include in a model, to ensure generalizability by increasing accuracy on unseen data, to increase performance during inference / production.

You don’t want your model to underfit (high bias), otherwise you wont be able to make reliable predictions on a dataset.

You don’t want your model to overfit (high variance), otherwise, you won’t be able to generalize to unseen data.

You want your model to be just right in terms of fitting a more general function approximation to your dataset¹.

For reference, I’ll be defining some of the more general notation here just in case I forget to do so in the below sections.

l = the index of a given layer.
n = the index of a given neuron at the lth layer
ŷ = output prediction of the neural network
y = ground truth label of a given sample
m = total number of samples in the dataset

L1 Regularization

So as mentioned earlier, L1 regularization involves the addition of a penalty term during to the loss function, which here, we’ll define as L(ŷ, y), where ŷ are the predicted label output of the model and y are the one-hot-encodings of the ground-truth labels.

Cross Entropy Loss

The penalty term that’s added to the loss function is computed as the manhattan distance or taxicab distance from the origin.

Typically, you’d define the manhattan distance between avector x and a vector y as:

Manhattan distance.

But in the case of the penalty term, we’ll be computing the manhattan norm, the magnitude of a vector from the origin of the L1 space.

This is also known as the L1 norm.

This L1 norm is multiplied by a hyperparameter, our regularization term lambda (ƛ) and then added to the loss function as a means to add an additional penalty.

Say for example, we want to regularize our parameters w in the second hidden layer of a neural network.

We’d compute as such:

Where n is a given neuron and l is a given layer. (Ignore the $, it’s a typo, lol).

Note: From now forth, w₂, is meant to be W₂, the capitalization indicating a matrix of weights, w. Typo on my end.

Multiplication by Lambda
Addition to the averaged Loss

Where J is the cost, which is the averaged loss across all samples or mini-batch of samples, m, which fed into a neural network.

L is averaged over total input samples per batch / mini-batch, as it’s more computationally efficient, yet still effective in terms of learning.

Then, the derivative of the penalty term, ƛ||w||₁, is taken and computed to then be added as a penalty term to the gradients of a given parameter, θ.

Given that we’re taking the derivative of a vector of absolute values, θ, the derivative then turns out to be sgn(θ).

Where the sgn() function is equal to:

1 if θ > 0
0 if θ = 0
-1 if θ < 0

Then the derivative of the penalty term turns out to be:

So then, this new penalty term is added to the gradient of the cost function, J(ŷ , y), with respect to the given parameter, θ.

As the earlier example, if we were computing the regularized gradient of the loss with respect to our weights, w, in the second layer of our neural network, we’d compute as such:

The “reg” gradient on the left, being our new gradient.

Then afterward, just as is done in gradient descent, we can apply our typical weight update.

Where alpha is the learning rate.

For intuition, what this entire process does is add a penalty term to the loss, based on the manhattan norm of a weighted matrix, w.

This process then also adds another penalty term to the gradient of the loss with respect to w.

Both of these penalty terms increase in size as the manhattan norm or magnitude of w increases, thereby increasing the loss and making the gradient steeper for higher values of w.

This then, in a sense “punishes” high values of w and enable a model to avoid those high values through the weight update.

In code, here’s how you might be able to apply it from scratch (we’ll be using only NumPy) onto a neural network:

Here, regularization is being applied to the first and second layers. You might only need to do a single layer, depending on the complexity of a layer.

Note:

Instead of using np.sum(np.abs(w)), you can make use of np.linalg.norm(w, ord = 1)

L2 Regularization

L2 regularization works in a very similar way as L1 regularization with the only difference being in the order of the norm being calculated for a given parameter, θ.

While L1 regularization makes use of the manhattan norm (or L1 norm), L2 regularization computes the squared euclidean norm also known as the Frobenius norm.

Essentially, this norm is the square root of the sum of all parameters θ squared, squared.

Yes, 2 “squares”. Not a typo.

It’s really a mouthful to say (or type), lol.

You can really just dumb this down to the summation of all squared θ values.

So this Frobenius norm is computed as:

So going back to our example, where we try to compute the norm for our weights, w, at the second layer of a neural network, we can compute it as:

The 2 next to W indicates the 2nd layer. The subscript 2 indicates the squared factor. The 2nd subscript 2 indicates the 2nd order norm, aka L2 norm.

Then, just as prior, we multiply the Frobenius norm with ƛ to get the penalty term and then add the penalty term to the cost function.

Again, where m is the total number of samples in the dataset.

Note: In the equations above, ||W|| is supposed to be ||W₂||, the 2 indicating that the weighted matrix belongs to the second layer.

Now, just as was done in L1 regularization, we then add the derivative of the penalty term, ||W₂||², to the gradient of the loss with respect to a given parameter θ.

In our example, we’ve computed the penalty term based on the weighted matrix of the second layer of our neural network.

So, we can add the derivative of ƛ||W₂||² to the gradient of the cost J(ŷ , y) with respect to our weighted matrix W₂.

Then we take the regularized gradient and use it in the update rule as:

For intuition, again just like the L1 regularization, this is essentialy adding a penalty term to the loss and to the gradient.

The higher a given weight is, the higher the penalty terms, both ||W₂||₂² and 2ƛW₂, will be, then in turn increasing the loss and the gradient which then punishes larger values of W₂.

Which ultimately regularizes the model weights and helps the neural network avoid large values of W₂.

In code, here’s how you might be able to apply L2 regularization to your model from scratch:

Again here, L2 regualrization is being applied to both 1st and 2nd layers of a neural network. You should only apply this to the layers that truly need them.

Again, you can leverage NumPy and use np.linalg.norm(w2, ord = 2) instead of (np.sum(np.square(w2)))

L2 regularization is the more commonly used form of regularization over L1 regularization as it’s derivative is smooth and continuous unlike L1 which is undefined at point 0, given the derivative of the absolute value,
x / | x |

Though the output of the derivative of the L1 penalty term is undefined at 0, np.sign (or np.where)s able to easily overcome this in practice. In theory, the non-continuity is where L1 can breakdown.

Dropout

For this section, I’ll be including some notes I took on the original paper on Dropout, Dropout: A Simple Way to Prevent Neural Networks from Overfitting” by Srivastava et al.

Paper Notes:

  • Combining models to improve generalization performance nearly always improves performance given that they were as trained on different samples of a dataset with a varied architecture (Ensemble methods)
  • Though, doing this with large neural networks and taking the average of the outputs can be computationally expensive.
  • It’s also daunting task to train multiple architectures and find the optimal hyperparameters as well as collect a good size of training data.
  • Dropout aims to address the issues, by preventing overfitting and combining variations of neural network architectures in a more efficient manner.
  • Essentially, dropout removes a set of neurons at a given iteration or epoch and keeps a set based on a keep probability, p. The probability of dropping a neuron is 1 — p
  • The result of applying dropout is then a thinner neural network, comprised of neurons “that survived dropout”
  • There are a collection of 2ⁿ thinned networks (given the random probability and the binary choice of including a neuron) where n is the total number of neurons in the overall network.
  • It isn’t feasible to apply the averaged outputs of the 2ⁿ thinned networks (given an exponential increase with limited computation power) so you can multiply the weights by the scaling factor p during testing or divide the activations during training by p.
  • Then, you can more easily, in a sense, combine the the thinned networks during test time as a single model, ultimately improving the likelihood of generalization.

Given a model of:

  1. z = wy + b
  2. y = f(z)

The model, with the dropout operation becomes:

  1. r ~ Bernoulli(p), where r is a vector of binary values (0 or 1) drawn from a Bernoulli distribution, with a probability p or a value being equivalent to 1
  2. ỹ = r * y, where are the inputs with drop out applied. So here, ỹ, is converted into a sparse input vector
  3. z = wỹ + b
  4. y = f(z)

This applies dropout purely to the inputs y.

  • During testing, you then must multiply the weights by the probability p to scale the weights, aligned with the 2ⁿ thinned networks.
  • Or during training, you can scale the the activations through a division of y (or a depending on the layer) by p. This is called ‘inverted dropout’.

Explaining Dropout

So in essence, dropout is a regularization technique that eliminates / ignores a specific set of neurons to optimize the network size to reduce overfitting and improve generalizability.

It aims to reduce co-dependence / co-adaptation amongst neurons.

Some neurons tend to depend on other neurons to do the ‘hard-work’ in providing contributions to the final prediction. Some neurons have higher values than others, while others don’t have values high enough that are able to contribute to the overall output.

When you dropout neurons randomly, the ‘lazy’ neurons must start learning and begin to reduce reliance on the ‘hard working’ neurons.

In dropout regularization, dropping neurons out means making a set of neurons 0, based on a probability 1-p, and normalizing the rest of the neurons that aren’t eliminated as a / p, where a are the inputs to a given layer.

Dropout is typically implemented per layer, with the probability p of keeping a neuron’s output during training. This probability is often lower for layers with a higher number of weights to reduce overfitting and higher for layers with a lower number of weights where overfitting is less of a concern.

So mathematically, where the activation is a, dropout looks as:

where a’ is the ‘dropped out’ activations, a are the original activations, and p is the probability of keeping a neuron.

So in code, dropout can be implemented in the forward pass as:

It’s typically best not to use dropout if your neural network doesn’t overfit.

But then, if your neural network isn’t overfitting, to maximize performance, you can increase the capacity of your model to make it overfit and then use dropout to build a model of larger capacity whilst still reducing the variance and improving generalizability.

If by any chance you found any unaddressed typos in the article, let me know and I’ll patch them up!

X | Newsletter | Email

[1] Interestingly, neural networks are universal function approximators, making their potential extremely optimistic for correctly fitting to data.

The key then, is to introduce model generalizability, regularization being a means to do so.

--

--