Deep Learning (Part 35)-Regularization

8 min readJun 17, 2024

📚 Chapter 6: Practical Aspects of Deep Learning

Introduction

If you suspect that your neural network is overfitting your data, indicating a high variance problem, one of the initial steps you should consider is implementing regularization. Another method to tackle high variance is to acquire more reliable training data. However, obtaining additional training data isn’t always possible, or it may be costly to acquire more data.

Regularization helps prevent overfitting in logistic regression by adding a penalty term to the cost function. This term controls the size of the coefficients, leading to simpler models that generalize better.

Sections

Regularization in logistic Regression
why do you regularize just the parameter w?
L2 Regularization?
Role of Lamda in Regularization
L2 regularization for Neural Network

Section 1- Regularization in logistic Regression

However, incorporating regularization can frequently aid in preventing overfitting or in reducing the variance within your network. Regularization is a technique used to prevent overfitting in logistic regression models by adding a penalty term to the cost function. This penalty term controls the size of the coefficients, leading to simpler models that are less likely to overfit and more likely to generalize well on unseen data.

In logistic regression, the goal is to minimize the cost function J, which is defined as the average of the cross-entropy loss over all training samples. Some of your training examples of the losses of the individual predictions in the different examples,

Recall that in logistic regression, ‘w’ and ‘b’ are the parameters, where ‘w’ is an x-dimensional vector and ‘b’ is a scalar. To incorporate regularization into logistic regression, you introduce the regularization parameter, lambda, into the model. But lambda over 2m times the norm of w squared.

So here, the norm of w squared, is just equal to sum from j equals 1 to nx of wj squared, or this can also be written w, transpose w, it’s just a square Euclidean norm of the prime to vector w. This is referred to as L2 regularization, which utilizes the Euclidean norm. Additionally, it is referred to as the L2 norm for the parameter vector w

Section 2- why do you regularize just the parameter w?

Why not include something here about ‘b’ as well? In practice, it’s possible to do so, but it’s often just omitted. Considering the parameters, w
is typically a high-dimensional parameter vector. Particularly in the case of a high variance problem, it may be that ‘w’ has an abundance of parameters, making it challenging to fit them all adequately, while ‘b’ represents just a single value.

Almost all the parameters are in ‘w’ rather than ‘b’. Adding this last term typically doesn’t make a significant difference because ‘b’ is only one parameter among many. In practice, it’s often not included, but it’s optional if you choose to do so.

Regularizing only www (the weights) is common because www is typically high-dimensional, whereas bbb (the bias) is just a single parameter. Regularizing www helps manage the complexity of the model more effectively.

Section 3- L2 Regularization?

Def: L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), is a technique used in machine learning and statistical modeling to prevent overfitting and enhance model generalization by adding a penalty term to the loss function.

Sparsity: One of the key characteristics of L1 regularization is that it can produce sparse solutions, meaning that it can drive some coefficients to exactly zero. This is useful for feature selection because it effectively reduces the number of features in the model.

Overfitting Prevention: By adding a penalty for large coefficients, L1 regularization discourages the model from fitting the noise in the training data, thus helping to prevent overfitting.

L2 regularization is the most commonly used type of regularization. You may have also heard some people mention L1 regularization.

. And that’s when you add a term that is lambda divided by m times the sum of this, instead of the L2 norm. This is also known as the L1 norm of the parameter vector w. so the little subscript 1 down there, right? And I guess whether you put m or 2m in the denominator, is just a scaling constant.

If you use L1 regularization, then w will end up being sparse. And what that means is that the w vector will have a lot of zeros in it. And some people say that this can help with compressing the model, because the set of parameters are zero, then you need less memory to store the model.

In practice, I’ve found that L1 regularization, which is intended to make models sparse, offers minimal help. Therefore, it seems to be less frequently used, especially not for model compression. Conversely, L2 regularization is much more commonly employed during network training.

Section 4- Role of Lamda in Regularization

Lambda is known as the regularization parameter. Typically, it is determined using your development set or through hold-out cross-validation. You experiment with various values to find the optimal balance between performance on your training set and maintaining a small norm of your parameters, which aids in preventing overfitting.

Thus, lambda represents an additional hyperparameter that may require adjustment.

Section 5- L2 regularization for Neural Network

In a neural network, you have a cost function that’s a function of all of your parameters, w[1], b[1] through w[capital L], b[capital L], where capital L is the number of layers in your neural network. And so the cost function is this, sum of the losses, sum over your m training examples. And so to add regularization, you add lambda over 2m, of sum over all of your parameters w, your parameter matrix is w, of their, that’s called the squared norm. Where, this norm of a matrix, really the squared norm, is defined as the sum of i, sum of j, of each of the elements of that matrix, squared. And if you want the indices of this summation, this is sum from i=1 through n[l minus 1]. Sum from j=1 through n[l], because w is a n[l] by n[l minus 1] dimensional matrix

In neural networks, L2 regularization is applied to the weights to prevent overfitting. It is also known as weight decay because it shrinks the weights during training.

Here, the number of hidden units refers to the number of units in layer l−1
for layer l
.

This matrix norm is known as the Frobenius norm, indicated by an ‘F’ in the subscript. For intricate reasons related to linear algebra, it is not referred to as the L2 norm of a matrix. Rather, it is conventionally called the Frobenius norm. While it might seem more intuitive to name it the L2 norm, traditional conventions dictate otherwise for reasons that are not essential for understanding its application.

It just means the sum of square of elements of a matrix. So how do you implement gradient descent with this? Previously, we would complete dw, you know, using backprop, where backprop would give us the partial derivative of J with respect to w, or really w for any given [l]. And then you update w[l], as w[l] minus the learning rate, times d. So this is before we added this extra regularization term to the objective. Now that we’ve added this regularization term to the objective, what you do is you take dw and you add to it, lambda over m times w. And then you just compute this update, same as before. And it turns out that with this new definition of dw[l], this is still, you know, this new dw[l] is still a correct definition of the derivative of your cost function, with respect to your parameters, now that you’ve added the extra regularization term at the end.

Indeed, L2 regularization is often referred to as weight decay because it effectively shrinks the weights during the training process.

So if I take this definition of dw[l] and just plug it in here, then you see that the update is w[l] gets updated as w[l] times the learning rate alpha times, you know, the thing from backprop, plus lambda over m, times w[l]. Let’s move the minus sign there. And so this is equal to w[l] minus alpha, lambda over m times w[l], minus alpha times, you know, the thing you got from backprop. And so this term shows that whatever the matrix w[l] is, you’re going to make it a little bit smaller, right? This is actually as if you’re taking the matrix w and you’re multiplying it by 1 minus alpha lambda over m. You’re really taking the matrix w and subtracting alpha lambda over m times this. Like you’re multiplying the matrix w by this number, which is going to be a little bit less than 1. So this is why L2 norm regularization is also called weight decay. Because it’s just like the ordinary gradient descent, where you update w by subtracting alpha, times the original gradient you got from backprop. But now you’re also, you know, multiplying w by this thing, which is a little bit less than 1. So the alternative name for L2 regularization is weight decay. I’m not really going to use that name, but the intuition for why it’s called weight decay is that this first term here, is equal to this. So you’re just multiplying the weight matrix by a number slightly less than 1

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at

Neural Networks and Deep Learning course

Improving Deep Neural Network course

🔍 Explore cutting-edge tools and Python libraries, access insightful slides and source code, and tap into a wealth of free online courses from top universities and organizations. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your Deep Learning potential!”

Stay tuned for our upcoming articles because we reach end to end ,where we will explore specific topics related to Deep Learning in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️

📚GitHub Repository

📝Notebook

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter