GRADIENT DESCENT

4 min readMar 19, 2024

Welcome back! A cost function is a component of supervised learning, as we have already covered. Check it out here if you haven’t already. We are going to talk about the gradient descent in this article.

Gradient descent is used all over the place in machine learning, not just for linear regression (if you don’t know about linear regression then you can check here), but for training for example some of the most advanced neural network models, also called deep learning models.

Just to make this discussion on gradient descent more general, it turns out that gradient descent applies to more general functions, including other cost functions that work with models that have more than two parameters. For instance, if you have a cost function J, your objective is to minimize j over the other parameters. In other words, you want to pick value, that gives you the smallest possible value of J. It turns out that gradient descent is an algorithm that you can apply to try to minimize this cost function J as well.

What you’re going to do is just to start off with some initial guesses for parameters. In linear regression, it won’t matter too much what the initial value are so a common choice is to set them both to 0.

With the gradient descent algorithm, what you’re going to do is keep on changing the parameters bit every time to try to reduce the cost j until hopefully j settles at or near a minimum.

One thing I should note is that for some functions j, is that there can be more than one possible minimum.

Let’s look at an example of a more complex surface plot j to see what gradient is doing. This function is not a squared error cost function. For linear regression with the squared error cost function, you always end up with a bow shape or a hammock shape. But this is a type of cost function you might get if you’re training a neural network model. Notice the axes, that is w and b(parameters) on the bottom axis. For different values of w and b, you get different points on this surface, j of w, b, where the height of the surface at some point is the value of the cost function.

What the gradient descent algorithm does is, you’re going to spin around 360 degrees and look around and ask yourself, if I were to take a tiny little step in one direction, and I want to go downhill as quickly as possible to or one of these valleys, what it will be? Mathematically, you will notice that the best direction to take your next step downhill is roughly the direction of steepest descent.

Take another step, another step, and so on, until you find yourself at the bottom of this valley, at this local minimum.

Remember that you can choose a starting point at the surface by choosing starting values for the parameters w and b.

Let’s look at how you can implement the gradient descent algorithm. Let me write down the gradient descent algorithm.

Now, since I said we’re assigning w a value using this equal sign, so in this context, this equal sign is the assignment operator.

In this equation, Alpha is called the learning rate. The learning rate is usually a small positive number between 0 and 1 and it might be said, 0.01. What Alpha does is, it basically controls how big of a step you take downhill. In combination with the learning rate Alpha, it also determines the size of the steps you want to take downhill.

Remember your model has two parameters, not just w, but also b. You also have assignment operations update the parameter b that looks very similar. b is assigned the old value of b minus the learning rate Alpha times this slightly different derivative term, d/db of J of wb.

One important detail is that for gradient descent, you want to simultaneously update w and b, meaning you want to update both parameters at the same time. What I mean by that, is that in this expression, you’re going to update w from the old w to a new w, and you’re also updating b from its oldest value to a new value of b. The way to implement this is to compute the right side, computing this thing for w and b, and simultaneously at the same time, update w and b to the new values.

Here’s the correct way to implement gradient descent which does a simultaneous update.

You probably have a basic understanding of gradient descent after reading this overview, which I hope was interesting. Thank you for reading, and keep checking back for our upcoming related topics.

You can connect me on the following:

Linkedin | GitHub | Medium | email : akshitaguru16@gmail.com

GRADIENT DESCENT

Written by Akshita Guru