Member-only story
The difference between Batch Gradient Descent and Stochastic Gradient Descent
[WARNING: TOO EASY!]
Let’s start with the simplest example, which is Linear Regression.
In Machine Learning, the cost function is always the first thing we look at.
Above are the linear regression cost function and its derivative. m indicates the number of training data points.
Now, let’s review the Gradient Descent algorithm.
What does the algorithm above say? ↑
It says that there are basically two steps to do the Gradient Descent. First, we have to figure out what the gradient of the cost function J is (where the red bracket is). Then, update the parameters by subtracting the gradient multiplied by a learning rate from the current values of the parameters, theta.
As the yellow circle shows, the first step in calculating the gradient of the cost function is to add up the cost of each sample. If we have 3 million samples, we have to loop through all 3 million samples or use the dot product.

