Gradient Descent

Understanding Gradient Descent

7 min readJul 14, 2022

Optimization

Let it be statistics, machine learning or any other fields. Optimization is used everywhere.

In Linear Regression we optimize Intercept(C) and Slope(m), in Logistic Regression we optimize S Curve, in t-SNE we optimize clusters. These are just few examples of optimization and there are many more.

So how do we learn all of these optimization techniques? The good news is we don’t have to. Gradient Descent by itself can help use optimize all of these and many more.

Gradient Descent

Imagine we are in a mountain range and we would like to go to the lowest valley.

Let’s look at some of the steps that will help us getting to the lowest valley,

Start from a random point
Check all directions to determine which direction will help us descend
Take a step in that direction.
Repeat 1 to 3 until none of the direction will help us descend.

Now, since we are in a mountain range there could be many valleys and we may not be in the lowest one at this point. If we can see other valleys for this point then we are at a local minima. If we cannot see any other valleys from this point then we are at global minima.

As we can see in the figure above, A and B are local minima but C is global minima.

Gradient Descent is an optimization method which tries to optimize the parameters or features to minimize the Cost or Loss function, close to 0. To optimize the parameters or features it tries to find the best weights and biases associated with these parameters.

Gradient simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops learning.

Gradient Descent calculates the Cost Function via Forward Propagation and updates the weights and biases via Backward Propagation.

Good thing about gradient descent is if we learn to optimize one cost function we have learned optimization for many other cost functions.

Concepts

Cost Function (Regression)

Gradient Descent

Slope

Slope is derivative of a function.

Slope

Step

Step is the step to be taken by Gradient Descent to move towards the minimum local.

Step

Now that we know some of the concepts let’s take an example to understand how Gradient descent works.

Local Minima/Minimum

A local minimum of a function is a point where the function value is smaller than at nearby points, but possibly greater than at a distant point.

Global Minima/Minimum

A global minimum is a point where the function value is smaller than at all other feasible points.

Iterations

Iterations is the number of batches of data the algorithm has seen (or simply the number of passes the algorithm has done on the dataset).

Epochs

Epochs is the number of times a learning algorithm sees the complete dataset.

Gradient Descent or Batch Gradient Descent (GD)

Batch Gradient Descent involves calculations over the full training set as a result of which it is very slow on very large training data.

We will use the Linear Regression as an example to understand the Gradient Descent steps.

The equation below represents a linear regression

Linear Regression

Forward Propagation

During the forward propagation we will calculate the cost function.

Step 1: Calculate the predicted value

First, Gradient Descent chooses a random initial values for the weights and biases. Assume,

Initial Weights and Biases

Plugging values to the regression equation we will get out first predicted values,

Initial Prediction

This predicted values can be way off but we don’t need to worry as Gradient Descent will take care of reducing this difference.

Step 2: Calculate the residual error and cost function

Using the initial predictions we will calculate the residual errors and the cost function.

At this point we have gone thru all the records once and we have out initial cost function to improve on. This is also the end of first epoch.

Backward Propagation

During the backward propagation we will calculate, new, and update weights and biases.

Step 3: Calculate Slope and Steps

To calculate new weights and biases we will follow few steps.

Step 3.1 Calculate Slope

Slope is a first derivative of cost function. Since we are calculating new values of weights and biases we will calculate the slope for both of these parameters.

Slope

Step 3.2 Calculate Step

Steps helps Gradient Descent a direction to take to reach the local minimum.

Step

If the slope is positive, gradient descent takes step towards negative
If the slope is negative, gradient descent takes step towards positive

A local minimum of a function is a point where the function value is smaller than at nearby points, but possibly greater than at a distant point. A global minimum is a point where the function value is smaller than at all other feasible points.

Step 4: Calculate new weights and bias

Once we have the Slope and Step we will calculate the new values of the weights and bias.

Updated Weights

Step 5: Update weights and biases

Now its time to update the old weights and biases with the new values.

Step 6: Repeat the steps from 1 to 5

This is how Batch Gradient Descent works. Thing to notice here is that in this optimization technique we consider the entire training dataset which can lead to a slow performance.

To improve this performance we have a modified version of Batch Gradient Descent called Stochastic Gradient Descent.

Stochastic Gradient Descent (SGD)

The overall steps remains unchanged but the main difference is, of course, the number of data points or samples to go through before each update of the parameters, which is 1 in case of SGD and all in case of GD.

In short, GD spans over entire dataset once (which is same as one epoch) before each update, whereas SGD randomly takes just one data point for each update.

In Gradient Descent the weights and biases are updated at the end of each epochs while in SGD the weights and biased are updated after a randomly selected sample within in a epoch is processed.

Batch Gradient Descent

If we have a dataset like below with features X1 to Xn and target Y. The cost function is calculated using the entire dataset.

Stochastic Gradient Descent

If we have a dataset like below with features X1 to Xn and target Y. The cost function is calculated using only 1 randomly selected record.

This makes SGD highly efficient and fast.

Mini Batch Gradient Descent

Mini Batch is similar to SGD but in this we take a batch of samples instead of selecting 1 random sample.

For example, if we have a 20 training samples we will use 5 samples for one forward pass to calculate cumulative error and then adjust the weights and biases.

Gradient Descent with Momentum

In this optimization technique we use Exponentially Weighted Averages. What the added momentum does is that it smothens the curve.

An Exponential Moving Average (EMA) is a type of Moving Average (MA) that places a greater weight and significance on the most recent data points. The exponential moving average is also referred to as the exponentially weighted moving average.