Mastering Gradient Descent: Optimizing Neural Networks with Precision.

om pramod
6 min readMar 10, 2024

--

Part 4: Types of Gradient Descent (Variants)

Batch gradient descent:

It is also called as vanilla gradient descent. Batch Gradient Descent is a type of Gradient Descent algorithm where the gradient of the cost function is calculated using the entire dataset. let’s consider a data set with ‘m’ observations. In the context of Batch Gradient Descent, we use all ‘m’ observations to calculate the cost function ‘J’. One cycle through entire training datasets is called a training epoch. In batch Gradient Descent since we are using the entire training set, the parameters will be updated only once per epoch. Hence, BGD performs model updates at the end of each training epoch. This means that the weights of the model are updated only once per epoch, after all instances of the dataset have been processed.

Here’s a step-by-step breakdown of how BGD works:

1. Calculate the cost for all training examples: The model makes predictions for all instances in the training set, and the cost (or error) of these predictions is calculated.

2. Compute the gradient of the cost function: This is done by taking the derivative of the cost function with respect to each parameter in the model.

3. The gradient points in the direction of steepest ascent in the cost function. The model parameters are updated in the direction opposite to the gradient.

These steps are repeated for a fixed number of iterations or until the cost function converges to a minimum. Significant amount of memory is needed to store the entire training set and the loss for each instance in the dataset. This characteristic makes Batch Gradient Descent computationally intensive and time consuming for large datasets, but it also often leads to more stable and reliable convergence to the minimum of the cost function.

Stochastic gradient descent:

Stochastic Gradient Descent (SGD) is indeed a popular alternative to Batch Gradient Descent, and it operates a bit differently. Here’s how it works in one epoch:

1. Shuffle the dataset: First, we shuffle the dataset. This is to ensure that the samples are independent and identically distributed.

2. For each instance in the dataset: We then loop over each instance in the dataset. Feed it to Neural Network.

a. Calculate the prediction: For the current instance, we calculate the prediction using the current model parameters.

b. Compute the error: We then compute the error of the prediction by comparing it to the actual output.

c. Calculate the gradients: We calculate the gradients of the loss function with respect to the model parameters. However, unlike in Batch Gradient Descent, we do this for the current instance only.

d. Update the model parameters: We update the model parameters in the direction of the negative gradient. The size of the step is determined by the learning rate. Each time the parameter is updated, it is known as an Iteration.

3. Repeat: We repeat the process for a number of epochs until the model’s performance on the dataset is satisfactory.

SGD is faster than other variants of Gradient Descent such as Batch Gradient Descent since it uses only one example to update the parameters. It is memory efficient because it considers one observation at a time from the complete dataset.

Since each update is based on a single data point,

· This can introduce high variance in parameter updates, leading to oscillations in the convergence path and

· The updates are noisy.

The high variance and noise in updates can lead to slower convergence. The noisy nature of updates can make the optimization path more erratic and less smooth. Due to the noisy updates, SGD may not always converge to the exact global minimum and can sometimes result in a suboptimal solution. While each update is computationally less expensive, the overall optimization process may require more iterations to reach convergence due to the noisy updates.

So far we’ve seen that if we use the entire dataset to calculate the cost function, it is known as Batch Gradient Descent and if use a single observation to calculate the cost it is known as SGD.

Mini-batch gradient descent: It is a variation of the Gradient Descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients. Here’s how it works :

1. Divide the dataset into mini-batches: The entire training dataset is divided into multiple subsets, each containing a small number of instances. The size of these mini-batches is a hyperparameter that you can tune.

2. For each mini-batch: We then loop over each mini-batch.

a. Calculate the prediction: For the current mini-batch, we calculate the prediction using the current model parameters.

b. Compute the error: We then compute the error of the prediction by comparing it to the actual output.

c. Calculate the gradients: We calculate the gradients of the loss function with respect to the model parameters. However, unlike in Batch Gradient Descent, we do this for the current mini-batch only.

d. Update the model parameters: We update the model parameters in the direction of the negative gradient. The size of the step is determined by the learning rate.

3. Repeat: We repeat the process for a number of epochs until the model’s performance on the dataset is satisfactory.

Mini-Batch Gradient Descent is more computationally efficient than Batch Gradient Descent as it doesn’t need to use the entire dataset at each step. It offers a good balance between the stability of Batch Gradient Descent and the speed and memory efficiency of Stochastic Gradient Descent.

Reference & Reference
Reference

Bonus point —

Reference

As shown in the above image, let’s take a basic 3D plot of the Cost function vs parameters 00 and 01. If we notice, there can be multiple local minimum values. Following the basic outline of optimization algorithms, our first step would be randomly selecting parameter values: 00 = 0 and 01 = 0. This is shown as the “First initialization of parameters” point in the image. By changing the values of parameters in the direction where the cost decreases, we might achieve the local minimum position 1. Our goal is to achieve global minima instead of local minima. Please note that if we select some different values for the parameters at our initialization step, 00 = 0.5 and 01 = 0.5, we might find some other local minima (local minima 2). One might ask, “Which minimum is correct?” This is a valid concern, but it’s important to note that the initial selection of parameter values can significantly impact finding the minimum value. Ideally, the cost function should not have multiple minima. However, if we are given a cost function with multiple minima, the better initialization of parameters would be the one that results in a lower value of the cost function.

Closing note — As we reach the end of Part 4, we’ve gained invaluable insights into the fundamentals of gradient descent optimization. Let this knowledge serve as a foundation for your continued exploration and experimentation in the realm of machine learning. Stay curious, stay innovative. Part 5 beckons, offering new horizons to explore. Until then, keep pushing boundaries, keep honing your skills, and let’s continue our journey towards optimization excellence together!

--

--