Demystifying Deep Learning Optimizers: Exploring Gradient Descent Algorithms (Part 1)

A Comprehensive Guide to Stochastic, Batch, and Mini-batch Gradient Descent

Hemant Rattey
Nerd For Tech
8 min readJul 16, 2023

--

Gradient Descent Intuition

Optimization algorithms are computational methods used to find the most optimal solution to any given problem. In the field of machine learning and deep learning, optimization algorithms are used to minimize some cost function based on the model parameters to perform optimally on the given task, such as regression and classification.

In this multi-part series, we will delve into various optimization techniques commonly used in deep learning and shed light on their inner workings and practical implications. For this article, we will look at Gradient Descent.

Now, let’s understand the core idea behind gradient descent and its role in the optimization process of deep learning models.

Gradient Descent

Gradient descent is the most basic and easily understandable optimization algorithm. It is commonly used in traditional ML models such as Linear Regression and Logistic Regression. But, for this series, we will focus on the working of Gradient Descent for Neural Networks.

The core idea behind Gradient Descent is to iteratively update model parameters until we reach a global minimum value of the cost function using gradients of the cost function w.r.t the model parameters. This is done by calculating the gradient of the cost function which represents the direction of the steepest ascent or descent. In the context of optimizing a deep learning model, the gradients are passed through the layer using back-propagation.

The gradient descent algorithm in neural networks functions in the following manner:

  • Initialize the model’s parameters (weights and biases) randomly: The process begins by assigning initial random values to the weights and biases of the neural network. These parameters determine how the inputs are transformed as they pass through the network’s layers.
  • Perform forward propagation: During this step, the input data is fed into the neural network, and computations are performed layer by layer. Each layer applies a linear transformation (dot product between the inputs and weights) followed by an activation function, which introduces non-linearity into the model. The output of each layer serves as the input for the subsequent layer until the final output or prediction is obtained.
  • Calculate the loss function: After obtaining the predicted output, it is compared with the actual output using a suitable loss function. The choice of the loss function depends on the specific task at hand, such as mean squared error for regression or cross-entropy for classification. The loss function quantifies the discrepancy between the predicted and actual outputs.
  • Compute the gradients: Backpropagation is used to calculate the gradients of the loss function with respect to each parameter in the neural network. The gradients represent the direction and magnitude of the steepest ascent or descent of the loss function. The process involves recursively applying the chain rule to compute the partial derivatives of the loss with respect to each parameter in each layer, starting from the final layer and propagating the gradients backward.
  • Update the parameters: The gradients obtained in the previous step indicate the direction in which the parameters need to be adjusted to minimize the loss function. The parameters are updated iteratively by taking steps in the opposite direction of the gradients, scaled by a learning rate hyperparameter. The learning rate determines the step size taken in each iteration and is crucial for balancing convergence speed and stability. This is what the update looks like:
  • Repeat until stopping criteria met: The forward propagation, loss calculation, gradient computation, and parameter update steps are repeated for a predefined number of iterations or until a convergence criterion is reached. Convergence is typically determined by monitoring the change in the loss function or the parameters over successive iterations.

However, in real-world scenarios, the size of the dataset can be enormous, making it computationally expensive to compute gradients over the entire dataset. Therefore, gradient descent can be further categorized into different algorithms based on the size of the data used during each parameter update: stochastic gradient descent (SGD), batch gradient descent, and mini-batch gradient descent.

Stochastic Gradient Descent

In Stochastic Gradient Descent (SGD), the updates to the parameters are done after every training example. So, after each training sample, the loss is calculated and corresponding weights are updated. For example, if you have a training dataset of size 500 samples, then in each iteration updates will be made to the weights i.e. 500 times. And this is usually done for multiple epochs. So if you run the training for 5 epochs, it would lead to 2500 updates to the model parameters.

Key points to remember about Stochastic Gradient Descent:

  1. SGD updates the weights after each individual training sample. The update direction is influenced by the gradient of a single sample at each iteration. This introduces noise and randomness into the weight updates. Consequently, the loss curve for SGD exhibits more fluctuations and irregularities over epochs.
  2. Since the algorithm makes many updates to the model parameters in each iteration, it usually converges faster.
  3. SGD’s random sampling of training samples helps it escape shallow local minima and saddle points, by allowing it to explore the parameter space more thoroughly which leads to faster convergence. But, this randomness can also cause it to give sub-optimal solution because of oscillations around global minima if learning rate is not chosen appropriately.
  4. It is unable to take the advantage of vectorization due to updates being done after every training example.

Batch Gradient Descent

In Batch Gradient Descent, the updates to the parameters are made after seeing the whole training data i.e. neural network sees the full training data in each iteration before making any updates to the parameters. For example, if a training set of size 1,000 samples is used, then in 1 epoch only 1 update will be made to the weights. Hence, the number of times the weights are updated is equal to the number of epochs the model is trained for. If the model runs for 10 epochs, then the model parameters will be updated only 10 times.

Key points to remember about Batch Gradient Descent:

  1. The graph of loss vs epochs for batch gradient descent tends to be smoother compared to SGD because the weights are updated after computing the average gradient over the entire training dataset. This means that the update direction is determined by considering the overall trend of the gradients across all training samples.
  2. Compared to SGD, batch gradient descent can be slower in terms of convergence speed. It takes a complete pass through the entire dataset to update the weights once, which can require a large number of iterations to converge to the desired solution.
  3. For a convex function, it will give better and more accurate gradients as compared to SGD, but since neural networks are generally non-convex functions, batch gradient descent can get stuck in a local minima because it does not have the capability to overshoot.
  4. Batch gradient descent requires memory to store the entire training dataset. If the dataset is large, batch gradient descent may consume a substantial amount of memory. For very large datasets that do not fit into memory, batch gradient descent may not be a feasible solution.

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a middle point between Stochastic Gradient Descent and Batch Gradient Descent. In this version of gradient descent, training data is divided into smaller subsets or mini-batches. Each mini-batch contains a fixed number of training examples. The weights are updated after each mini-batch has been processed. For example, if the training data contains 1000 samples and the batch size is 50, then in each epoch, the weights will be updated 20 times. And if you run the algorithm for 5 epochs, you will have 100 updates to the model parameters.

Key things to remember about Mini-Batch Gradient Descent:

  1. Mini-batch gradient descent typically converges faster than batch gradient descent while offering better convergence properties compared to SGD. The mini-batches allow for more frequent weight updates than batch gradient descent, leading to faster convergence. At the same time, the mini-batches help to smooth out the noise and instability introduced by stochastic gradient descent, resulting in improved convergence compared to SGD.
  2. The choice of mini-batch size is an important hyperparameter that needs to be tuned. A smaller mini-batch size can lead to more noisy updates and slower convergence, while a larger mini-batch size may reduce the level of noise but may require more memory and computational resources.
  3. Mini-batch gradient descent strikes a balance between exploration and exploitation of the parameter space. The mini-batches introduce some randomness and diversity in each iteration. This randomness enables the algorithm to explore different regions of the optimization landscape, helping it to escape shallow local minima and navigate through saddle points.

Limitations

While gradient descent algorithms, including batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, are widely used for training neural networks, they also have some common disadvantages:

  1. Learning rate sensitivity: Gradient Descent in general is very sensitive to the choice of learning rate. If the learning rate is too large, the updates may overshoot the optimal solution, leading to instability and the algorithm failing to converge. On the other hand, if the learning rate is too small, the updates may be too conservative, resulting in slow convergence or getting stuck in suboptimal solutions.
  2. Sensitivity to Initialization: The initial values of the model parameters in neural networks can significantly impact the convergence and performance of gradient descent algorithms. Poor initialization choices may lead to slow convergence or getting stuck in poor solutions.
  3. Local Minima and Saddle Points: Neural networks with multiple parameters and complex architectures can have high-dimensional optimization landscapes. Gradient descent algorithms may encounter challenges when navigating these landscapes, including getting stuck in suboptimal local minima or slow convergence in saddle points.

Next Steps

In this article, I went through a brief discussion on the first optimization algorithm which is gradient descent. We discussed the different versions of gradient descent and what are the limitations of gradient descent in general. In the next articles, we will discuss other optimization algorithms that try and overcome the limitations of gradient descent.

This article is supposed to be a place where I can come and revise my concepts quickly without having to search online for answers. This article is also meant to help people who are just starting in the field of deep learning.

Please feel free to give me feedback regarding the same.

Come say Hi! to me on Twitter.

--

--

Hemant Rattey
Nerd For Tech

Data Scientist | Writing about Deep Learning and NLP | Portfolio: hemantrattey.github.io