Gradient Descent Algorithm

Gradient descent is used all over the place in machine learning, not just for linear regression, but for training for example some of the most advanced neural network models, also called deep learning models.

Sakshi Jain
11 min readJun 7, 2023

You have the cost function j of w, b right here that you want to minimize. In the example we’ve seen so far, this is a cost function for linear regression, but it turns out that gradient descent is an algorithm that you can use to try to minimize any function, not just a cost function for linear regression. Just to make this discussion on gradient descent more general, it turns out that gradient descent applies to more general functions, including other cost functions that work with models that have more than two parameters. For instance, if you have a cost function J as a function of w_1, w_2 up to w_n and b, your objective is to minimize j over the parameters w_1 to w_n and b. In other words, you want to pick values for w_1 through w_n and b, which gives you the smallest possible value of j.

It turns out that gradient descent is an algorithm that you can apply to try to minimize this cost function j as well. What you’re going to do is just to start with some initial guesses for w and b. In linear regression, it won’t matter too much what the initial values are, so a common choice is to set them both to 0. For example, you can set w to 0 and b to 0 as the initial guess. With the gradient descent algorithm, what you’re going to do is, you’ll keep on changing the parameters w and b a bit every time to try to reduce the cost j of w, b until hopefully, j settles at or near a minimum.

Dependency on w and b

One thing I should note is that for some functions j that may not be a bow shape or a hammock shape, there can be more than one possible minimum.

Let’s take a look at an example of a more complex surface plot j to see what the gradient is doing. This function is not a squared error cost function. For linear regression with the squared error cost function, you always end up with a bow shape or a hammock shape. But this is a type of cost function you might get if you’re training a neural network model. Notice the axes, that is w and b on the bottom axis. For different values of w and b, you get different points on this surface, j of w, b, where the height of the surface at some point is the value of the cost function. Now, let’s imagine that this surface plot is actually a view of a slightly hilly outdoor park or a golf course where the high points are hills and the low points are valleys like so. I’d like you to imagine if you will, that you are physically standing at this point on the hill. If it helps you to relax, imagine that there’s lots of really nice green grass and butterflies and flowers is a nice hill. Your goal is to start up here and get to the bottom of one of these valleys as efficiently as possible.

What the gradient descent algorithm does is, you’re going to spin around 360 degrees and look around and ask yourself, if I were to take a tiny little baby step in one direction, and I want to go downhill as quickly as possible to or one of these valleys. What direction do I choose to take that baby step? Well, if you want to walk down this hill as efficiently as possible, it turns out that if you’re standing at this point in the hill and you look around, you will notice that the best direction to take your next step downhill is roughly that direction. Mathematically, this is the direction of the steepest descent. It means that when you take a tiny baby little step, this takes you downhill faster than a tiny little baby step you could have taken in any other direction.

Multiple steps towards local minima

After taking this first step, you’re now at this point on the hill over here. Now let’s repeat the process. Standing at this new point, you’re going to again spin around 360 degrees and ask yourself, in what direction will I take the next little baby step in order to move downhill? If you do that and take another step, you end up moving a bit in that direction and you can keep going. From this new point, you can again look around and decide what direction would take you downhill most quickly. Take another step, another step, and so on, until you find yourself at the bottom of this valley, at this local minimum, right here. What you just did was go through multiple steps of gradient descent. It turns out, gradient descent has an interesting property. Remember that you can choose a starting point at the surface by choosing starting values for the parameters w and b. Now, imagine you try gradient descent again, but this time you choose a different starting point by choosing parameters that place your starting point just a couple of steps to the right over here. If you then repeat the gradient descent process, which means you look around, take a little step in the direction of the steepest ascent so you end up here. Then you again look around, take another step, and so on. If you were to run gradient descent this second time, starting just a couple of steps to the right of where we did it the first time, then you end up in a totally different valley. This different minimum over here on the right.

Gradient Descent for the second time

The bottoms of both the first and the second valleys are called local minima. Because if you start going down the first valley, gradient descent won’t lead you to the second valley, and the same is true if you started going down the second valley, you stay in that second minimum and not find your way into the first local minimum

Implementing Gradient Descent

Let’s take a look at how you can implement the gradient descent algorithm. On each step, w, the parameter, is updated to the old value of w minus Alpha times this term d/dw of the cos function J of w,b. What this expression is saying is, after your parameter w by taking the current value of w and adjusting it a small amount, which is this expression on the right, minus Alpha times this term. The symbol here is the Greek alphabet Alpha. In this equation, Alpha is also called the learning rate. The learning rate is usually a small positive number between 0 and 1 and it might be say, 0.01. What Alpha does is, it basically controls how big of a step you take downhill. If Alpha is very large, then that corresponds to a very aggressive gradient descent procedure where you’re trying to take huge steps downhill, this derivative term that I drew a magenta box around as telling you in which direction you want to take your baby step. In combination with the learning rate Alpha, it also determines the size of the steps you want to take downhill.

Learning rate derivative= Alpha

Remember your model has two parameters, not just w, but also b. You also have assignment operations update the parameter b that looks very similar. b is assigned the old value of b minus the learning rate Alpha times this slightly different derivative term, d/db of J of w,b. Remember in the graph of the surface plot where you’re taking baby steps until you get to the bottom of the value, well, for the gradient descent algorithm, you’re going to repeat these two update steps until the algorithm converges. By converges, I mean that you reach the point at a local minimum where the parameters w and b no longer change much with each additional step that you take. Now, there’s one more subtle detail about how to correctly in semantic gradient descent, you’re going to update two parameters, w and b. This update takes place for both parameters, w, and b. One important detail is that for gradient descent, you want to simultaneously update w and b, meaning you want to update both parameters at the same time. What I mean by that, is that in this expression, you’re going to update w from the old w to a new w, and you’re also updating b from its oldest value to a new value of b. The way to implement this is to compute the right side, computing this thing for w and b, and simultaneously at the same time, update w and b to the new values. Let’s take a look at what this means. Here’s the correct way to implement gradient descent which does a simultaneous update. This sets a variable temp_w equal to that expression, which is w minus that term here. There’s also a set in another variable temp_b to that, which is b minus that term. You compute both for hand sides, both updates and store them into variables temp_w and temp_b. Then you copy the value of temp_w into w, and you also copy the value of temp_b into b. Now, one thing you may notice is that this value of w is from the for w gets updated. Here, I noticed that the pre-update w is where it goes into the derivative term over here.

Correct and Incorrect method

In contrast, here is an incorrect implementation of gradient descent that does not do a simultaneous update. In this incorrect implementation, we compute temp_w, the same as before, so far that’s okay. Now here’s where things start to differ. We then update w with the value in temp_w before calculating the new value for the other parameter to be. Next, we calculate temp_b as b minus that term here, and finally, we update b with the value in temp_b. The difference between the right-hand side and the left-hand side implementations is that if you look over here, this w has already been updated to this new value, and this is updated w that actually goes into the cost function j of w, b. It means that this term here on the right is not the same as this term over here that you see on the left. That also means this temp_b term on the right is not quite the same as the temp b term on the left, and thus this updated value for b on the right is not the same as this updated value for variable b on the left. The way that gradient descent is implemented in code, it turns out to be more natural to implement it the correct way with simultaneous updates. When you hear someone talk about gradient descent, they always mean the gradient descents where you perform a simultaneous update of the parameters. If however, you were to implement a non-simultaneous update, it turns out it will probably work more or less anyway. But doing it this way isn’t the correct way to implement it, is some other algorithm with different properties. I would advise you to just stick to the correct simultaneous update and not use this incorrect version on the right. That’s gradient descent.

The choice of the learning rate, alpha will have a huge impact on the efficiency of your implementation of gradient descent. And if alpha, the learning rate is chosen poorly rate of descent may not even work at all.

W is updated to be W minus the learning rate, alpha times the derivative term. To learn more about what the learning rate alpha is doing. Let’s see what could happen if the learning rate alpha is either too small or if it is too large. For the case where the learning rate is too small. Here’s a graph where the horizontal axis is W and the vertical axis is the cost J. And here’s the graph of the function J of W. Let’s start grading descent at this point here if the learning rate is too small. Then what happens is that you multiply your derivative term by some really, really small number. So you’re going to be multiplying by the number alpha. That’s really small, like 0.0000001. And so you end up taking a very small baby step like that. Then from this point, you’re going to take another tiny tiny little baby step. But because the learning rate is so small, the second step is also just minuscule. The outcome of this process is that you do end up decreasing the cost J but incredibly slowly.

Gradient Descent if Alpha is too small

What happens if the learning rate is too large? Here’s another graph of the cost function. And let’s say we start grating descent with W at this value here. So it’s actually already pretty close to the minimum. So the decorative points to the right. But if the learning rate is too large then you update W very giant step to be all the way over here. And that’s this point here on the function J. So you move from this point on the left, all the way to this point on the right. And now the cost has actually gotten worse. It has increased because it started out at this value here and after one step, it actually increased to this value here. Now the derivative at this new point says to decrease W but when the learning rate is too big. Then you may take a huge step going from here all the way out here. So now you’ve gotten to this point here and again if the learning rate is too big. Then you take another huge step with acceleration and way overshoots the minimum again. So now you’re at this point on the right and one more time you do another update. And end up all the way here and so you’re now at this point here. So as you may notice you’re actually getting further and further away from the minimum. So if the learning rate is too large, then creating the sense may overshoot and may never reach the minimum. And another way to say that is that great intersect may fail to converge and may even diverge.

Gradient descent can reach a local minimum, even with a fixed learning rate alpha

--

--

Sakshi Jain

The only person you are destined to become is the person you decide to be.