Machine learning (Part 31)-Cost Function and Regularization

10 min readApr 23, 2024

📚Chapter:6- Regularization

Introduction

In this blog, I’d like to convey to you, the main intuitions behind how regularization works. And, we’ll also write down the cost function that we’ll use, when we were using regularization. With the hand drawn examples that we have on these slides, I think I’ll be able to convey part of the intuition. But, an even better way to see for yourself, how regularization works, is if you implement it, and, see it work for yourself. And, if you do the appropriate exercises after this, you get the chance to self see regularization in action for yourself.

Sections

Intuition
Regularization
Regularization parameter is set to be very large

Section 1- Intuition

So, here is the intuition. In the previous blog, we saw that, if we were to fit a quadratic function to this data, it gives us a pretty good fit to the data. Whereas, if we were to fit an overly high order degree polynomial, we end up with a curve that may fit the training set very well, but, really not be a, but overfit the data poorly, and, not generalize well.

Consider the following, suppose we were to penalize, and, make the parameters theta 3 and theta 4 really small. Here’s what I mean, here is our optimization objective, or here is our optimization problem, where we minimize our usual squared error cause function. Let’s say I take this objective and modify it and add to it, plus 1000 theta 3 squared, plus 1000 theta 4 squared. 1000 I am just writing down as some huge number. Now, if we were to minimize this function, the only way to make this new cost function small is if theta 3 and data 4 are small, right? Because otherwise, if you have a thousand times theta 3, this new cost functions gonna be big. So when we minimize this new function we are going to end up with theta 3 close to 0 and theta 4 close to 0, and as if we’re getting rid of these two terms over there. And if we do that, well then, if theta 3 and theta 4 close to 0 then we are being left with a quadratic function, and, so, we end up with a fit to the data, that’s, you know, quadratic function plus maybe, tiny contributions from small terms, theta 3, theta 4, that they may be very close to 0. And, so, we end up with essentially, a quadratic function, which is good. Because this is a much better hypothesis. In this particular example, we looked at the effect of penalizing two of the parameter values being large.

Section 2- Regularization

More generally, here is the idea behind regularization. The idea is that, if we have small values for the parameters, then, having small values for the parameters, will somehow, will usually correspond to having a simpler hypothesis. So, for our last example, we penalize just theta 3 and theta 4 and when both of these were close to zero, we wound up with a much simpler hypothesis that was essentially a quadratic function. But more broadly, if we penalize all the parameters usually that, we can think of that, as trying to give us a simpler hypothesis as well because when, you know, these parameters are as close as you in this example, that gave us a quadratic function. But more generally, it is possible to show that having smaller values of the parameters corresponds to usually smoother functions as well for the simpler. And which are therefore, also, less prone to overfitting. I realize that the reasoning for why having all the parameters be small. Why that corresponds to a simpler hypothesis; I realize that reasoning may not be entirely clear to you right now. And it is kind of hard to explain unless you implement yourself and see it for yourself. But I hope that the example of having theta 3 and theta 4 be small and how that gave us a simpler hypothesis, I hope that helps explain why, at least give some intuition as to why this might be true.

Lets look at the specific example. For housing price prediction we may have our hundred features that we talked about where may be x1 is the size, x2 is the number of bedrooms, x3 is the number of floors and so on. And we may we may have a hundred features. And unlike the polynomial example, we don’t know, right, we don’t know that theta 3, theta 4, are the high order polynomial terms. So, if we have just a bag, if we have just a set of a hundred features, it’s hard to pick in advance which are the ones that are less likely to be relevant. So we have a hundred or a hundred one parameters. And we don’t know which ones to pick, we don’t know which parameters to try to pick, to try to shrink. So, in regularization, what we’re going to do, is take our cost function, here’s my cost function for linear regression. And what I’m going to do is, modify this cost function to shrink all of my parameters, because, you know, I don’t know which one or two to try to shrink. So I am going to modify my cost function to add a term at the end. Like so we have square brackets here as well. When I add an extra regularization term at the end to shrink every single parameter and so this term we tend to shrink all of my parameters theta 1, theta 2, theta 3 up to theta 100. By the way, by convention the summation here starts from one so I am not actually going penalize theta zero being large. That sort of the convention that, the sum I equals one through N, rather than I equals zero through N. But in practice, it makes very little difference, and, whether you include, you know, theta zero or not, in practice, make very little difference to the results. But by convention, usually, we regularize only theta through theta 100. Writing down our regularized optimization objective, our regularized cost function again.

Here it is. Here’s J of theta where, this term on the right is a regularization term and lambda here is called the regularization parameter and what lambda does, is it controls a trade off between two different goals. The first goal, capture it by the first goal objective, is that we would like to train, is that we would like to fit the training data well. We would like to fit the training set well.

And the second goal is, we want to keep the parameters small, and that’s captured by the second term, by the regularization objective. And by the regularization term. And what lambda, the regularization parameter does is the controls the trade of between these two goals, between the goal of fitting the training set well and the goal of keeping the parameter plan small and therefore keeping the hypothesis relatively simple to avoid overfitting.

For our housing price prediction example, whereas, previously, if we had fit a very high order polynomial, we may have wound up with a very, sort of wiggly or curvy function like this. If you still fit a high order polynomial with all the polynomial features in there, but instead, you just make sure, to use this sole of regularized objective, then what you can get out is in fact a curve that isn’t quite a quadratic function, but is much smoother and much simpler and maybe a curve like the magenta line that, you know, gives a much better hypothesis for this data.

Once again, I realize it can be a bit difficult to see why strengthening the parameters can have this effect, but if you implement yourselves with regularization you will be able to see this effect firsthand.

Section 3- Regularization parameter is set to be very large.

In regularized linear regression, if the regularization parameter monitor is set to be very large, then what will happen is we will end up penalizing the parameters theta 1, theta 2, theta 3, theta 4 very highly. That is, if our hypothesis is this is one down at the bottom. And if we end up penalizing theta 1, theta 2, theta 3, theta 4 very heavily, then we end up with all of these parameters close to zero, right?

Theta 1 will be close to zero; theta 2 will be close to zero. Theta three and theta four will end up being close to zero. And if we do that, it’s as if we’re getting rid of these terms in the hypothesis so that we’re just left with a hypothesis that will say that. It says that, well, housing prices are equal to theta zero, and that is akin to fitting a flat horizontal straight line to the data. And this is an example of underfitting, and in particular this hypothesis, this straight line it just fails to fit the training set well. It’s just a fat straight line, it doesn’t go, you know, go near. It doesn’t go anywhere near most of the training examples. And another way of saying this is that this hypothesis has too strong a preconception or too high bias that housing prices are just equal to theta zero, and despite the clear data to the contrary, you know chooses to fit a sort of, flat line, just a flat horizontal line. I didn’t draw that very well. This just a horizontal flat line to the data. So for regularization to work well, some care should be taken, to choose a good choice for the regularization parameter lambda as well. And when we talk about multi-selection later in this course, we’ll talk about a way, a variety of ways for automatically choosing the regularization parameter lambda as well. So, that’s the idea of the high regularization and the cost function reviews in order to use regularization In the next two blog, lets take these ideas and apply them to linear regression and to logistic regression, so that we can then get them to avoid overfitting.

Key Points:

Intuition Behind Regularization:

Penalizing specific parameters, such as theta 3 and theta 4, to be very small leads to simpler hypotheses.
Smaller parameter values correspond to smoother functions, reducing the risk of overfitting.

Regularization Process:

Regularization aims to keep parameter values small by adding a regularization term to the cost function.
The regularization parameter lambda balances the trade-off between fitting the training data well and keeping parameters small.
By penalizing all parameters, regularization promotes simpler hypotheses that generalize better.

Application in Housing Price Prediction:

Regularization helps avoid overfitting in housing price prediction by producing smoother curves instead of overly complex ones.
Careful selection of the regularization parameter lambda is crucial for effective regularization.

Effect of Large Regularization Parameter:

Setting the regularization parameter lambda to be very large results in heavy penalization of all parameters.
This can lead to underfitting, where the hypothesis becomes overly simplistic and fails to fit the training data well.

Choosing the Regularization Parameter:

Methods for automatically selecting the regularization parameter lambda will be discussed later in the course.

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Enroll now for top-tier content and kickstart your data journey!

Understanding of Machine Learning Course!

Stay tuned for our upcoming articles because we research end to end , where we will explore specific topics related to Machine Learning in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️

Note:if you are a Machine Learning export and have some good suggestions to improve this blog to share, you write comments and contribute.

if you need more update about Machine Learning and want to contribute then following and enroll in following

👉Course: Machine Learning (ML)

👉📚GitHub Repository

👉 📝Notebook

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter

Source

In this blog post, we’ve gathered information from different places like the Machine Learning Courses, research papers, technical blogs, official documents, YouTube videos, and others. We’ve given credit to each source under the related pictures, and we’ve included links to those sources

[1]- Machine Learning — Andrew