Gradient Descent and Stochastic Gradient Descent Algorithms for Neural Networks

Roan Gylberth
Konvergen.AI
Published in
4 min readMay 1, 2018

Everyone who ever have trained Neural Networks, chances are, have been stumbled with Gradient Descent algorithm or its variations. These algorithms are used to find parameter that minimize the value of loss function in Neural Networks. The way it works is quite simple, for every iterations 𝑡, we compute the gradient estimate by taking the average gradient on all pair of examples in the training set 𝒮

where n is total examples in the training set, 𝓛 is the loss function we wish to minimize, and ∇θ𝓛 is the gradients, the vector of partial derivatives of 𝓛 with respect to θ𝑡. In neural networks, we can get the gradient value using Back-propagation Algorithm. After computing the gradient estimate, we update the parameter along the opposite direction of the gradient

where η is the learning rate, a value that control how much we allow the parameter to follow the opposite direction of the gradient estimate. For an intuition and the technical explanation about this algorithm, you can refer to Michael Nielsen’s great explanation about it which you can find at this post.

We can see from equation (1), to compute the gradient estimate in gradient descent, we have to compute the average over all n examples. This can be a bad thing when our training set grows into millions or billions of examples, every iteration will become very long. Stochastic Gradient Descent can be used to avoid this computational problem. At every iteration, we sample a mini-batch 𝕄 of n′ examples from the training set. These examples are drawn uniformly from the training set, and the mini-batch size n′ is fixed and relatively smaller than n. To update the parameters, we compute the gradient estimate from all pair of examples in the mini-batch 𝕄, rather than from the training set 𝒮, so the gradient estimate computation becomes

After computing the gradient estimate, we update the parameter along the opposite direction of the gradient

As you can see, the update rule is similar to gradient descent, the difference lies only in the gradient estimate used in updating the parameters.

The advantage of stochastic gradient descent is that the gradient estimate is computed only from the examples in mini-batch, so the size of training set will not affect the computation time. This is a good thing that we can grow our training set without worrying about the computational problem, since larger training set allowing us to use more complex models with a lower chance of over-fitting. Interestingly, the expected value of gradient in stochastic gradient descent is equal to gradient in batch gradient descent, which can be shown easily when we use mini-batch with size 1 (n′=1). The gradient computation when evaluated in x, where x is chosen uniformly at any iteration 𝑡 becomes

Since we sample the examples uniformly over n examples in the training set, we can compute the expectation with

In stochastic gradient descent, your gradient estimation can be viewed as the gradient estimation in all example in the training set added by noise generated by the randomly sampled mini-batch. You might think “Why we follow noised gradients in stochastic gradient descent if gradient descent offers you gradient without noise?” This is an interesting question, and the answer to this question lies in the beginning of this post. You may have noticed it, I have always use the phrase “gradient estimate”. This is because we don’t know all possible pair of data, so we can’t compute the “true gradient”, we can only approximate the “true gradient” from the pair of examples in our training set. In gradient descent algorithm, you compute the gradient estimation from all examples in the training set. This can cause your gradient estimate to be biased towards the training set. This is another advantage of stochastic gradient descent, i.e., small mini-batch size can act like a regularizer to the gradient estimation.

Please note that the term Stochastic Gradient Descent usually used for the online case or using one example at each iteration. In this post, I use the term Stochastic Gradient Descent for Mini-batch Gradient Descent, and the online case occurs when n′=1. Of course, Stochastic Gradient Descent has some limitations that lead to some modifications of the algorithms, mainly to improve the convergence rate of the algorithms. Some notable algorithms that mainly used to train Neural Networks are Momentum and Nesterov, which are the topics of the next post. There are other algorithms, e.g., AdaGrad, RMSProp, Adam, etc., that hopefully become the subject of future posts.

References

  1. I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning” (2016)
  2. L. Bottou, “Online Learning and Stochastic Approximations” (1998)

--

--