Why direction of steepest descent is always opposite to the gradient of loss function?

Shikhar Goswami
Analytics Vidhya
Published in
2 min readJun 23, 2020

We have all heard about the gradient descent algorithm and how its used in updating parameters in a way to always minimize the loss function at each iteration. And, the steepest descent is when the loss function is minimized the most. But do you know why the steepest descent is always opposite to the gradient of loss function? Or why we call the algorithm ‘gradient descent’? Let’s find out!

w(t) = w(t-1) -α(∇L)

In this update equation, -∇L is that ‘opposite’ direction. But why?

It’s all in the taylor series!

Let’s consider a small change in w. The updated weight will become, w(new)=w+ ηΔw. (Remenber w is a vector, so Δw is the change in direction and η is the magnitude of change).

Now, let Δw = u and transpose(u) = v. The new loss function w.r.t w(new) will be L( w + ηu).

Given it is a small change in w, we can write the new loss function by taylor expansion as follows:

L(w + ηu) = L(w) + (η). v∇L(w) + (η²/2!). v∇²L(w)u + ….

Now, η is very small. So, we can ignore terms containing η², η³ and later terms. So, our new equation becomes,

L(w + ηu)-L(w) = (η). v∇L(w)

As we know, our goal is to minimize loss function L at each step. Therefore,

L(w + ηu)-L(w) < 0 i.e new loss is less than the old loss. From above equation we can say, v.∇L(w)<0

v.∇L(w) is the dot product. Let β be the angle between v and ∇L(w)

cos(β) = (v.∇L(w))/|v||∇L(w)|. Let |v||∇L(w)|=k for simplicity.

Therefore, cos(β) = (v.∇L(w))/k, where -1≤cos(β)≤1.

Now, we want v.∇L(w) as low/negative as possible( we want our new loss to be as smaller than old loss as possible). Therefore we want cos(β) to be as low as possible. The lowest value cos(β) can take is -1. In this case, β = 180 degrees.

Therefore, in order to minimize the loss function the most, the algorithm always steers in the direction opposite to the gradient of loss function, ∇L

https://ml-cheatsheet.readthedocs.io/en/latest/_images/gradient_descent_demystified

Note: The concept of this article was based on videos of course CS7015: Deep Learning taught at NPTEL Online.

Resources:

  1. Taylor Series
  2. Gradient

--

--