Variants of Gradient Descent

  1. Batch Gradient Descent
  2. Stochastic Gradient Descent
  3. Mini-Batch Gradient Descent

1. Batch Gradient Descent


  • During training, we may use a fixed learning rate without thinking of its decay.
  • It is a deterministic algorithm.
  • It is computationally efficient since it produces a stable error gradient and a stable convergence.
  • It guarantees to converge to global minima in case of convex error surfaces and to local minima in case of non-convex surfaces.


  • Since it uses the whole data set at every iteration, hence, it makes the computation very, very slow especially when the data set is very large.
  • When we go through all the examples present, then learning happens at every step and it may be possible that some examples are redundant and do not contribute much to update. Hence, checking them over and over again is time-consuming and useless.
  • It requires that the whole training data set is in memory and also available to the algorithm.
  • We can get the local minimum by this approach, but it is not necessary that the local minimum is also the global minimum.
Batch gradient descent cost reduction w.r.t the number of iterations, Source: [3]

What is over-fitting?

2. Stochastic Gradient Descent


  • Frequent updates improve the model.
  • The frequency can also result in noisy gradients which may cause the error to increase instead of decreasing it.
  • Allows the use of large data sets as it has to update only one example at a time, hence fewer instances are to be recognized by the model at a time.


  • The frequent updates are more computationally expensive.
  • Due to its stochastic (random) approach, this algorithm is less regular than the previous one.
  • This algorithm may also result in a local minimum but not in the global minimum.
  • Variance is very large as the objective function fluctuates heavily on one example at each learning step.
Cost function with respect to the number of iteration, Source: [3]

3. Mini-Batch Gradient Descent


  • It is less erratic.
  • It leads to more stable updates.
  • It gets closer to minimum.
  • It leads to more stable convergence.
Cost Function with respect to the number of iterations, Source: [3]
Gradient descent variants’ trajectory towards minimum, Source: [6]







Department of Computer Science and Engineering, Mody University, Lakshmangarh.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Labeling images for an Object Detection Model with Labelocity

Machine Learning for Diabetes

Classification with unbalanced data

Understanding SVMs using IRIS Dataset

Let’s Deploy a Machine Learning Model

The Machine Learning Lifecycle and MLOps: Building and Operationalizing ML Models — Part I

Handling Imbalanced Data

2022: Deep Learning Nanodegree vs Deep Learning Specialization

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sunil kumar Jangir

Sunil kumar Jangir

Department of Computer Science and Engineering, Mody University, Lakshmangarh.

More from Medium

Deep Convolutional Neural Networks (DCNNs) explained in layman's terms

Machine Learning: Regularization Techniques

Support Vector Machine - the big ideas

What is Backpropagation?