Stochastic Gradient Descent (SGD)

3 min readFeb 28, 2023

Stochastic Gradient Descent is different from Gradient Descent in a way when it updates the parameters or hyperparameters. In SGD the parameters are updated after each sample while in case of GD they are updated after entire dataset (same as one epoch).

In Gradient Descent, the weights and biases are updated at the end of each epochs while in SGD the weights and biased are updated after a randomly selected sample within in a epoch is processed.

Example

Lets take an example to predicting a house price.

The ML model will look like

Step 1: While running the 1st Epoch pick a random sample and calculate the predicted value and the error.

Note: In GD we calculate MSE before updating the weights and biases as GD considers entire dataset before updating the weights and biases.

Step 2: Calculate new weights and biases

2.1: Calculate Slope

Slope is a derivative of the cost function w.r.t weight or bias.

2.2: Calculate Step

Tells the GD a direction to take to reach the local minimum.

2.3: Calculate new weights and bias

Once we have the Slope and Step we will calculate the new values of the weights and bias.

Step 3: The back propagation will now update the weights and biases with these new values.

Step 4: Now pick another random sample and calculate the error.

Repeat the process. Once we have reached the end of the dataset this will be the end of 1st Epoch.

Step 5: Repeat the above process for the remaining Epochs.

At the end of the entire training the Cost Function or Error will be minimum. This is how SGD minimizes the error or Cost Function.

Mini Batch Gradient Descent

Mini Batch is similar to SGD but in this we take a batch of samples instead of selecting a random sample.

For example, if we have a 20 training samples we will use 5 samples for one forward pass to calculate cumulative error and then adjust the weights and biases.

I hope this article provides you with a basic understanding of Stochastic Gradient Descent.

If you have any questions or if you find anything misrepresented please let me know.

Thanks!

Stochastic Gradient Descent (SGD)