How Does Stochastic Gradient Descent Find the Global Minima?

ANURAG MUKHERJEE
The Startup
Published in
2 min readJul 11, 2020

Stochastic Gradient Descent is known for the randomness that it introduces while iterating over the best parameter set to fit it’s training set. But does this only mean that the algorithm infuses only speed to find a good set of parameters?

Actually mathematics reveals one more advantage of SGD: it severely smoothens any continuously differentiable complex cost function having several local minima.

Though convex optimisation proofs deal with the mathematical part more rigorously, here we shall get a intuitive feel about why and where does this happen.

Modern neural networks inherently work on more abstract complexity resulting in loss functions being highly non-convex and having every type of uncomfortable variations: peaks, saddle points and local minima. Finding the global minima is the greater concern here as that would take us closer to the optimal hyper-parameters.

The true gradient descent follows this update algorithm: x(t+1)= x(t)- η ∇ f(x,θ)

As you an see, the descent is negative of the loss function’s gradient (- η ∇ f(x,θ)) and hence has greater probability to fall into one of those saddle points or local minima.

In SGD, the descent isn’t truly along gradient (that’s why you will see ‘Gradient’ in SGD sounds misnomer here), but along a deviated direction: x(t+1)= x(t)- η (∇ f(x,θ)+w(t)) where w(t)~W(x) which is a random distribution of samples.

The additional w(t) term makes the algorithm take a much different course than that of the loss function’s gradient.

Fig 1: SGD escapes local minima X0

Here -η(v(t))= -η(∇ f(x,θ)+w(t)). Thus, the red line being along the gradient of f(x,θ) gets you towards the local minima x0. The intuition behind the w(t) is to add some form of noise i.e. a completely random distribution of samples W(x).

This actually convolves the loss function into a more smoothened one i.e. the shallower valleys in the landscape (the local minima here) get insignificant and hence the net gradient is directed towards the global minima most of the time.

So, we see that the grey line towards -η(v(t)) gets away from the local minima altogether. Thus the noisiness in SGD actually helps it to get away from the trouble areas in loss function.

So here are some right hand rules when to use SGD and when to avoid:

  1. If data-set is huge: in terms of samples as well as features, use SGD right away.
  2. If there are not so many features: then choose total gradient descent as then we will have a better chance to get optimal hyper-parameters
  3. If data-set is not large enough: use anything but not SGD as it will terribly over-fit the model.

So, I hope you got a better intuition into the dynamics behind the famed Stochastic Gradient Descent and it’s usefulness: both computationally efficient as well as a guaranteed global minima while training neural networks.

--

--

ANURAG MUKHERJEE
The Startup

Pursuing Undergraduate at IIT Roorkee. Interested in digging up the “mathematical drama” behind machine learning models!