Why we move our weights in opposite direction of Gradients?

In this post i will give some intuition about why we are moving our weights in opposite direction of gradients for finding minimum of a function using gradient descent.

let’s consider loss function L(θ) then update rule of θ in gradient descent is given by below equation

i will discuss about why we are doing minus in above equation. Let’s consider change in loss Δθ and learning rate is α. so our final loss after iteration is L(θ+α.Δθ). From Taylor series expansion we can write L(θ+α.Δθ) as below

for small α we can neglect all words with α^{n} where n≥2 so equation is

In gradient descent we are reducing loss in every iteration so present loss is less than previous loss so from above equation

alpha is positive so

above equation is dot product between ∇ L(θ) and Δθ. assume angle between these two as γ then cos(γ) is

lets assume ||∇ L(θ)|| .||Δθ|| = p so

Range of cos function is [-1,1] so

We are finding the change in θ i.e Δθ such that ∇ L(θ).Δθ < 0 ⇒ p.cos(γ) < 0 . it is less than zero when γ in (90,270) and cos(γ) is more negative if γ = 180 degrees (cos(180) = -1). so we are moving opposite direction of gradient to get present loss less than the previous loss.

References:

  1. Applied AI Course
  2. CS7015: Deep Learning — 2018 by IIT Madras