Gradient Descent in deep learning: a mountain perspective

Fabio Moretti
Quantyca
Published in
6 min readJan 27, 2020

This is my first post on Medium (please, be kind 😅), so let me introduce myself.

I’m Fabio and after my degree in Particle Physics I work for Quantyca (if you are curious, I suggest you to visit my company website, or our linkedin page to discover what kind of stuff we do).

I would like to share with you some thoughs about Gradient Descent in machine learning that I had during my work in Quantyca Analytics Team and my last hike to my home mountain.

“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it”

- Eliezer Yudkowsky

Let’s try to avoid this danger viewing the very first working mechanism of Artificial Intelligence. To a large extent, Deep Learning is about solving optimization problems, very massive ones by the way. So, a Neural Network is a function (very complicated indeed), in which we have a ton of parameters: this function represents solution to a problem in a mathematical perspective,

Training neural networks means we are minimising this function (called loss function). The value of this loss function gives us a perception of what is the distance between our network performance and perfection on a given dataset.

Loss Function 🌄

Presolana Sud Di Ago76 — Opera propria, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17459010

Last sunday I was on the top of mountain Presolana, a very beautiful group of mountain near my hometown. The trail was very rocky, exposed and not so easy, so during the descent I followed this “Presolana-descent” strategy:

  1. 👀 Evaluated all the possible paths around my position
  2. Rejected the ones going up and all ones without handholds. These paths would actually cost me more energy (kinetic against potential) and make my task even more difficult
  3. 📉 Took the path that I think had the most slope downhill and was safer

Thanks to these steps, I was able to reach my home safe.

Well, the loss function in deep learning provides the same strategy: it maps decisions and possible paths to their associated costs (computational costs, no more energetic costs).

Let’s focus now on the nice loss function contours in the picture (well, in this case nice means two or few parameters. In real world a loss functions has thousand of parameters and the contour plot is more like a canyon landscape).

An ideal loss function contour

On the x and y axes are plotted the values of the two weights. The z axis represents the value of the loss function for couples of weights. Neural Network training goal is to find the particular couple of weights for which the loss is minimum (in other word, the goal is to reach the minima for the loss function).

In the beginning weigths are randomly chosen (so your neural network is probably behaving like a drunk version of yourself). Such a situation corresponds to the top of mountain Presolana (a point on the contour) where many descent path can be chosen: the network is performing badly and consequently the loss is high.

We need to find a way to somehow navigate where the loss function has a minima (the home after Presolana climbing). This way is called Gradient Descent and it also follow our downhill strategy.

Gradient Descent 📉

When we choose the inital values of our weights, we are at a certain point in the loss landscape. Like the first point of our Presolana-descent algorithm we check out all possible directions in the xy plane. When the direction which brings the steepest decline in the value of the loss function is choosen, we move along it. This path is given by the direction exactly opposite to the direction of gradient (a relative in multiple dimesions of 1-D derivative). So, the gradient gives us the direction with the steepest ascent: this is how the algorithm gets it’s name.

Now, once we have the direction we want to move in, we must decide the size of the step we must take. The the size of this step is called the learning rate and we must chose it carefully to ensure we can get down to the minima:

  • 🏇 Too fast: we might overshoot the minima and keep bouncing along the ridges of the “valley”
  • 🐌 Too slow: training might turn out to be too long to be sustainable. Slow learning rates make the algorithm more prone to get stuck in a local minima (maybe we’ll see this danger and how to avoid it in another post)

Now we recompute the gradient at whatever position we end up at and repeat the process (just like the mountain Presolana descent).

While the direction of the gradient tells us which direction has the steepest ascent, its value tells us how steep the ascent/descent is. So, at the minima you would expect the gradient to be almost zero because the contour is almost flat.

In a real situation, we might never exactly reach the minima, but we keep oscillating in the reached flat region near the minima. Because of flatness of the region, the oscillation-given loss is almost the minimum we can achieve, and doesn’t change much. Often, we stop the bounce when the loss values haven’t improved in a prefixed number of iterations. When this kind of situation happens, we say our training has converged.

Some (not so) boring formulas 🔬

Now let’s try to briefly formalize all the process we have seen. A basic formalized transformation that describes the update rule of gradient descent is.

This transformation is performed during every iteration and it is composed by:

  • ω is the weights vector, which lies in the xy plane
  • Lₘ(ω) is the loss function
  • α is the learning rate
  • ∇ω is gradient on weights dimensions

From the weights vector, we subtract the gradient of the loss function with respect to the weights times the learning rate. The gradient is a vector which gives us the direction in which loss function has the steepest ascent (minus sign is due to opposite direction to gradient we have to take).

Almost the same update rule is applied to every weight of the network simultaneously. The only change is that since we are performing the update individually for each weight, the gradient is replaced with the projection of the gradient vector along the direction represented by the specific weight.

This update is simultaneously done for all the weights.

Before subtracting we multiply the gradient vector by the learning rate: this operation represents mathematically the step we discussed earlier. Please note that even if we keep α constant, the size of step could change due to changes of gradient value, or due the steepness of the loss contour. As soon as we approach a minima, the gradient approaches zero and we take smaller and smaller steps towards the minima. This is useful: having a step size too large may cause algorithm to overshoot a minima and bounce between the ridges of the minima.

References

Hope you enjoyed this reading! I put here some references you can use to understand deeper Neural Network, Gradient Descent and its pratical applications:

  • David E. Rumelhart and James L. McClelland, Learning Internal Representations by Error Propagation in Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations , MITP, 1987
  • W.A. Gardner, Learning characteristics of stochastic-gradient-descent algorithms: A general study, analysis, and critique in Signal Processing, Volume 6, Issue 2, 1984
  • Sebastian Ruder, An overview of gradient descent optimization algorithms in 1609.04747 arXiv, 2016 (here the pdf)
  • Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Brendan Shillingford and Nando de Freitas, Learning to learn by gradient descent by gradient descent in 1606.04474 arXiv, 2016 (here the pdf)
  • Rohan Kshirsagar, Life is gradient descent in hackernoon.com (here the article)

--

--