Mehmet Toprak
5 min readMay 29, 2020

Gradient Descent

In this blog, I will talk about gradient descent. Let’s say that we have some data, and here’s a scatter plot of the data.

For simplicity, I have generated data where z is twice x. And say we want to find the value of w that would generate a line that best fits this data. To do that, we define a cost or a loss function. One common cost or loss function is the one shown here as J, where we take the difference between the z values and the product of w and x’s. We square that and we sum the squared difference across all values of z and x. The best value of w would then be the value that results in the minimum value of this cost or loss function.

Let’s take a look at how this cost function looks like.

What makes this loss or cost function attractive is that it is a parabola, and has one global minimum or one unique solution. For the given data, the value of w that makes this cost function minimum, is w equals 2, meaning z equals 2x, which would result in a line that fits the points perfectly. This is a very simplified example as in real world datasets, the target variable z would be dependent on more than one variable and we can’t just simply plot the cost function and visually determine the best value of the weights. So how do we determine the best value of w. The best value of w, or w’s in case you have many weights to optimize, is determined through an algorithm called gradient descent. Gradient descent is an iterative optimization algorithm for finding the minimum of a function. To find the minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. What does that mean?

We start at a random initial value of w. Let’s call it w0, and say it’s equal to 0.2, and we start taking steps towards the green dot which is w = 2. To determine in which direction to move, we compute the gradient of the loss function at the current value of w, which is 0.2. The gradient is given by the slope of the tangent at w = 0.2, and then the magnitude of the step is controlled by a parameter called the learning rate. The larger the learning rate, the bigger the step we take, and the smaller the learning rate, the smaller the step we take. Then we take the step and we move to w1. w1 is essentially computed as w0 minus the learning rate times the gradient at w0. This represents the first iteration of the algorithm. At w1, we repeat the same process of computing the gradient at w1 and using the same learning rate to control the magnitude of the step towards the minimum. We keep repeating this step again and again until we hit the minimum or a value of the cost function that is very close to the minimum, within a very small predefined threshold. Now when choosing the learning rate, we have to be very careful as a large learning rate can lead to big steps and eventually missing the minimum. On the other hand, a small learning rate can result in very small steps and therefore causing the algorithm to take a long time to find the minimum point.

Now let’s see how each iteration with a learning rate of 0.4 affects the way the resulting line fits the data on the left.

We initialize w to 0, which means z equals 0. It is a horizontal line and therefore the cost is high and the line as you can see is a bad fit. After the 1st iteration, w moves closer to 2 and because the slope is very steep at w=0, the new value of w results in a big drop in the loss function. The resulting line fits better than the initial one but there is still room for improvement.

After the 2nd iteration, w continues moving toward w = 2. Because the slope is not as steep as before, the step is not as big but the cost function still drops in value and the resulting line is moving closer to the ideal best fit line. The same observation is noted with the 3rd iteration, and the 4th iteration.

After 4 iterations, you can see how we are almost there at w = 2, and the resulting line almost fits the scatterplot perfectly. With each iteration, the weight is updated in a way that’s proportional to the negative of the gradient of the function at the current point. Therefore, if you initialize the weight to a value that is to the right of the minimum, then the positive gradient will result in w moving to the left towards the minimum.

Mehmet Toprak

Data Scientist | Machine Learning Engineer | Data Engineer