Why we move opposite to gradients in Gradient Descent??

Published in

Analytics Vidhya

3 min readAug 31, 2019

Gradient descent is an iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient of the function at the current point. In other words we move in the direction opposite to our gradients. Have you ever wondered why only direction opposite? Why not in the direction of gradient???

Here we will try to mathematically prove the reason to why we move in the direction opposite to the gradients.

First let us try to understand gradients. Gradient is a vector consisting of partial derivatives of a function with respect to the variables. Suppose our function looks like J(w, b) = w² + b² then the partial derivative of J(w, b) with respect to w and b are

The gradient is thus a vector of these partial derivatives as follows:

Now lets try to find out optimal direction of movement for reaching the minima. We can derive this using the Taylor series. Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point.

If J(w) is current cost value, and we perform a minor update η*u in the weights w then the new loss would be J(w + η*u). Here η is the learning rate hyperparameter Taylor series for this function will look like:

Since η is very small the values after η² will be very close to zero. Hence we can ignore them. The above equation thus becomes:

The new loss J(w + η*u) should ideally be less than previous loss J(w). Thus the above value should be less than zero, ie

We know that dot product of two vectors is the cosine of the angle between the vectors. That is a.b = |a||b|cos(θ). Using this for above equation we get

where β is the angle between u and gradient. We know that the value of cos θ lies between -1 and 1. Thus the above equation will also lie between

Assuming denominator equal to k we get k = || u || * || ∇ J(w) ||. Or

We want the loss difference to be negative or minimum. Since k is positive the loss difference will be negative when cos(β) will be negative. The minimum value of k*cos(β) will be when β will be 180 degrees. That is the angle between u and gradient should be 180 degree. This means that we should move in the direction 180 degrees with the gradient or opposite to the gradient for maximum loss reduction.

I hope you all have got the answer to the question we all started with. Thanks a lot for reading.

Deep learning — NPTEL by Mitesh Khapra
https://www.youtube.com/watch?v=giZD8yzXEZ4&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT&index=22

Why we move opposite to gradients in Gradient Descent??

Written by Anjana Yadav