Difference between Batch Gradient Descent and Stochastic Gradient Descent

Hemakannanms
3 min readSep 10, 2020

What is Gradient Descent Method?

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

It is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

1. Batch Gradient Descent Method

This is a type of gradient descent which processes all the training examples for each iteration of gradient descent. But if the number of training examples is large, then batch gradient descent is computationally very expensive. Hence if the number of training examples is large, then batch gradient descent is not preferred. Instead, we prefer to use stochastic gradient descent or mini-batch gradient descent.

How does it works ?

As we need to calculate the gradient on the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don’t fit in memory. After initializing the parameter ( Say θ1=θ2=…=θn=0) with arbitrary values we calculate gradient of cost function using following relation:

where ‘m’ is the number of training examples.

  • If you have 10k records you need to read in all the records into memory from disk because you can’t store them all in memory.
  • After calculating sigma for one iteration, we move one step.
  • Then repeat for every step.
  • This means it take a long time to converge.

So,we will prefer to use other methods.

2. Stochastic Gradient Descent Method

This is a type of gradient descent which processes 1 training example per iteration. Hence, the parameters are being updated even after one iteration in which only a single example has been processed. Hence this is quite faster than batch gradient descent. But again, when the number of training examples is large, even then it processes only one example which can be additional overhead for the system as the number of iterations will be quite large.

How does it work?

The first step of algorithm is to randomize the whole training set. Then, for updation of every parameter we use only one training example in every iteration to compute the gradient of cost function.

As it uses one training example in every iteration this method is faster for larger data set. In Stochastic gradient descent method , one might not achieve accuracy, but the computation of results are faster.
After initializing the parameter( Say θ1=θ2=…=θn=0) with arbitrary values we calculate gradient of cost function using following relation:

where ‘m’ is number of training examples

  • pick first training example and update the parameter using this example, then for second example and so on
  • Then pick second training example and update the parameter using this example, and so on for ‘ m ‘.
  • Now take third … n steps in algorithm.
  • Until we reach global minimum.

Stochastic gradient descent never actually converges like batch gradient descent does,but ends up wandering around some region close to the global minimum.

simply we can classify in table below:

--

--