Deep Learning Course — Lesson 6: Optimization Algorithms — Gradient Descent Variants

7 min readMay 27, 2023

When we train a neural network, our goal is to find the parameters (weights and biases) that minimize the loss function. This process is known as optimization. One of the most commonly used methods for optimization in training neural networks is called Gradient Descent. It’s an iterative approach that involves taking steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point in hopes of finding the minimum of the function.

Gradient Descent Variants

There are three main types of gradient descent, which differ in how much data we use to compute the gradient of the objective function:

Batch Gradient Descent

Batch Gradient Descent, also known simply as Gradient Descent, is a method used to find the minimum of a function. In the context of machine learning, this function is often a loss or cost function that measures how well the model fits the training data.

Here’s a step-by-step explanation of how Batch Gradient Descent works:

1. Initialization: We start by initializing the model parameters, often randomly. These parameters are the weights and biases in a neural network.

2. Compute Gradient: Next, we calculate the gradient of the loss function. The gradient is a vector that points in the direction of the steepest increase of the function. It is calculated by taking the derivative of the loss function with respect to each parameter. Because we’re using the entire dataset for this calculation, the resulting gradient is an average of all the individual gradients for each data point.

3. Update Parameters: We then update the parameters in the opposite direction of the gradient. This is because we want to minimize the loss function, so we need to go in the direction of steepest decrease. The size of the step we take is determined by the learning rate, a hyperparameter that we set beforehand. The formula for updating each parameter is: parameter = parameter - learning_rate * gradient.

4. Iterate: We repeat steps 2 and 3 for a set number of iterations, or until the parameters stop changing significantly.

The primary advantage of Batch Gradient Descent is its simplicity and guaranteed convergence to a global minimum if the loss function is convex, or to a local minimum if the loss function is not convex. The primary disadvantage is that it can be slow and computationally intensive, particularly for large datasets, because it uses all the training data to compute the gradient at every step.

Additionally, because it computes an average gradient across the whole dataset, it can get stuck in shallow, non-optimal minima instead of noisy but deeper minima that might be found when looking at smaller batches of data.

In practice, while Batch Gradient Descent is an important method to understand, Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent are more commonly used due to their better efficiency and performance, especially on larger datasets.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a variation of the Gradient Descent algorithm that calculates the gradient and updates the model parameters using a single sample at each iteration, as opposed to the entire dataset in Batch Gradient Descent. Here’s a detailed breakdown of SGD:

Initialization: Like in Batch Gradient Descent, we start by initializing the model parameters, typically randomly.
Compute Gradient: Next, instead of calculating the average gradient using the entire dataset, we compute the gradient using only a single randomly chosen data point. This gradient computation is less accurate than the average gradient, but it is also far less computationally intensive.
Update Parameters: We then update the parameters in the direction opposite to the computed gradient. The formula for updating each parameter is the same as in Batch Gradient Descent: parameter = parameter - learning_rate * gradient.
Iterate: We repeat steps 2 and 3 for a set number of iterations, or until the parameters stop changing significantly.

The advantage of SGD over Batch Gradient Descent is that it is much faster and can be used on large datasets that would not fit in memory all at once. In addition, because SGD’s updates are noisy (due to using only a single data point), it can jump out of shallow local minima, making it less prone to getting stuck in non-optimal solutions.

However, this same noisiness can also be a disadvantage, as it can cause the algorithm to bounce around the optimal solution, making it hard for SGD to settle down to the exact minimum. It might keep overshooting due to the random nature of the single sample chosen, which can lead to a slower convergence.

To alleviate this issue and to get the benefits of both Batch Gradient Descent and SGD, another variant is often used: Mini-Batch Gradient Descent, which calculates the gradient and updates parameters using a small batch of data points rather than a single one.

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a variation of Gradient Descent that strikes a balance between Batch Gradient Descent and Stochastic Gradient Descent. Instead of using the entire dataset (as in Batch Gradient Descent) or a single sample (as in Stochastic Gradient Descent) to compute the gradient of the cost function, Mini-Batch Gradient Descent uses a subset of the data. Here’s a detailed breakdown:

Initialization: As with the other methods, we start by initializing the model parameters, typically randomly.
Compute Gradient: Next, instead of calculating the gradient using either the entire dataset or a single randomly chosen data point, we compute the gradient using a small batch of data points. The size of this batch, known as the “mini-batch size”, is a hyperparameter that we can tune. It’s usually chosen to be a power of 2, such as 32, 64, 128, to work well with computer memory architecture.
Update Parameters: We then update the parameters in the direction opposite to the computed gradient. The formula for updating each parameter is the same as in other gradient descent methods: parameter = parameter - learning_rate * gradient.
Iterate: We repeat steps 2 and 3 for a set number of iterations, or “epochs”, through the entire dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.

Mini-Batch Gradient Descent offers a compromise between the efficiency of Stochastic Gradient Descent and the computational rigour of Batch Gradient Descent. By using mini-batches, we can use vectorized hardware optimization of modern CPUs and GPUs more efficiently than SGD could, while still not requiring to use the whole dataset at once as in Batch Gradient Descent, which makes it applicable to large datasets.

Mini-Batch Gradient Descent tends to converge faster than Batch Gradient Descent as updates to the model are made more frequently. In terms of convergence behavior, it tends to be less noisy than Stochastic Gradient Descent, making it easier to converge to the optimal solution, but is still noisy enough to have a good chance of escaping from shallow local minima.

Let’s test your understanding

Which variant of Gradient Descent uses the entire dataset to compute the gradient of the cost function at each iteration?
a. Stochastic Gradient Descent
b. Mini-Batch Gradient Descent
c. Batch Gradient Descent
d. None of the above
What is a primary disadvantage of Batch Gradient Descent?
a. It doesn’t converge to the optimal solution
b. It’s too fast
c. It can be slow and computationally intensive, especially for large datasets
d. None of the above
In Stochastic Gradient Descent, how many data points are used to compute the gradient and update the parameters at each iteration?
a. All data points
b. A small batch of data points
c. A single randomly chosen data point
d. None of the above
Why is the Mini-Batch Gradient Descent method often preferred in practice over the other two variants?
a. It strikes a balance between the efficiency of SGD and the computational rigour of Batch Gradient Descent
b. It doesn’t need to use vectorized hardware optimization
c. It uses the whole dataset at once
d. None of the above
What does one “epoch” mean in the context of Mini-Batch Gradient Descent?
a. Going through the entire dataset once
b. Updating the model parameters once
c. Completing one step in the direction of the gradient
d. None of the above
What is the primary advantage of Stochastic Gradient Descent (SGD) over Batch Gradient Descent?
a) SGD is less computationally intensive and faster
b) SGD is slower and more computationally intensive
c) SGD uses more data to compute the gradient
d) SGD uses less data to compute the gradient
Which of the following methods is best suited for large datasets?
a) Batch Gradient Descent
b) Stochastic Gradient Descent (SGD)
c) Mini-Batch Gradient Descent
d) None of the above
Why is it important to shuffle the dataset at each epoch while using Mini-Batch Gradient Descent?
a) To ensure that each data point has the same chance of being selected
b) To prevent the model from learning the order of the data
c) Shuffling the data provides better accuracy
d) Both a) and b)
In Batch Gradient Descent, the size of the steps taken in the direction of the negative gradient is determined by __________.
a) the loss function
b) the learning rate
c) the model parameters
d) the dataset size
Which of the following statements about Mini-Batch Gradient Descent is FALSE?
a) Mini-Batch Gradient Descent tends to converge faster than Batch Gradient Descent.
b) Mini-Batch Gradient Descent is computationally more efficient than Stochastic Gradient Descent.
c) Mini-Batch Gradient Descent uses the entire dataset to compute the gradient.
d) Mini-Batch Gradient Descent is less noisy than Stochastic Gradient Descent.