Gradient Descent
Optimization
Let it be statistics, machine learning or any other fields. Optimization is used everywhere.
In Linear Regression we optimize Intercept(C) and Slope(m), in Logistic Regression we optimize S Curve, in t-SNE we optimize clusters. These are just few examples of optimization and there are many more.
So how do we learn all of these optimization techniques? The good news is we don’t have to. Gradient Descent by itself can help use optimize all of these and many more.
Gradient Descent
Imagine we are in a mountain range and we would like to go to the lowest valley.
Let’s look at some of the steps that will help us getting to the lowest valley,
- Start from a random point
- Check all directions to determine which direction will help us descend
- Take a step in that direction.
- Repeat 1 to 3 until none of the direction will help us descend.
Now, since we are in a mountain range there could be many valleys and we may not be in the lowest one at this point. If we can see other valleys for this point then we are at a local minima. If we cannot see any other valleys from this point then we are at global minima.
As we can see in the figure above, A and B are local minima but C is global minima.
Gradient Descent is an optimization method which tries to optimize the parameters or features to minimize the Cost or Loss function, close to 0. To optimize the parameters or features it tries to find the best weights and biases associated with these parameters.
Gradient simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops learning.
Gradient Descent calculates the Cost Function via Forward Propagation and updates the weights and biases via Backward Propagation.
Good thing about gradient descent is if we learn to optimize one cost function we have learned optimization for many other cost functions.
Concepts
Cost Function (Regression)
Gradient Descent
Slope
Slope is derivative of a function.
Step
Step is the step to be taken by Gradient Descent to move towards the minimum local.
Now that we know some of the concepts let’s take an example to understand how Gradient descent works.
Local Minima/Minimum
A local minimum of a function is a point where the function value is smaller than at nearby points, but possibly greater than at a distant point.
Global Minima/Minimum
A global minimum is a point where the function value is smaller than at all other feasible points.
Iterations
Iterations is the number of batches of data the algorithm has seen (or simply the number of passes the algorithm has done on the dataset).
Epochs
Epochs is the number of times a learning algorithm sees the complete dataset.
Gradient Descent or Batch Gradient Descent (GD)
Batch Gradient Descent involves calculations over the full training set as a result of which it is very slow on very large training data.
We will use the Linear Regression as an example to understand the Gradient Descent steps.
The equation below represents a linear regression
Forward Propagation
During the forward propagation we will calculate the cost function.
Step 1: Calculate the predicted value
First, Gradient Descent chooses a random initial values for the weights and biases. Assume,
Plugging values to the regression equation we will get out first predicted values,
This predicted values can be way off but we don’t need to worry as Gradient Descent will take care of reducing this difference.
Step 2: Calculate the residual error and cost function
Using the initial predictions we will calculate the residual errors and the cost function.
At this point we have gone thru all the records once and we have out initial cost function to improve on. This is also the end of first epoch.
Backward Propagation
During the backward propagation we will calculate, new, and update weights and biases.
Step 3: Calculate Slope and Steps
To calculate new weights and biases we will follow few steps.
Step 3.1 Calculate Slope
Slope is a first derivative of cost function. Since we are calculating new values of weights and biases we will calculate the slope for both of these parameters.
Step 3.2 Calculate Step
Steps helps Gradient Descent a direction to take to reach the local minimum.
If the slope is positive, gradient descent takes step towards negative
If the slope is negative, gradient descent takes step towards positive
A local minimum of a function is a point where the function value is smaller than at nearby points, but possibly greater than at a distant point. A global minimum is a point where the function value is smaller than at all other feasible points.
Step 4: Calculate new weights and bias
Once we have the Slope and Step we will calculate the new values of the weights and bias.
Step 5: Update weights and biases
Now its time to update the old weights and biases with the new values.
Step 6: Repeat the steps from 1 to 5
This is how Batch Gradient Descent works. Thing to notice here is that in this optimization technique we consider the entire training dataset which can lead to a slow performance.
To improve this performance we have a modified version of Batch Gradient Descent called Stochastic Gradient Descent.
Stochastic Gradient Descent (SGD)
The overall steps remains unchanged but the main difference is, of course, the number of data points or samples to go through before each update of the parameters, which is 1 in case of SGD and all in case of GD.
In short, GD spans over entire dataset once (which is same as one epoch) before each update, whereas SGD randomly takes just one data point for each update.
In Gradient Descent the weights and biases are updated at the end of each epochs while in SGD the weights and biased are updated after a randomly selected sample within in a epoch is processed.
Batch Gradient Descent
If we have a dataset like below with features X1 to Xn and target Y. The cost function is calculated using the entire dataset.
Stochastic Gradient Descent
If we have a dataset like below with features X1 to Xn and target Y. The cost function is calculated using only 1 randomly selected record.
This makes SGD highly efficient and fast.
Mini Batch Gradient Descent
Mini Batch is similar to SGD but in this we take a batch of samples instead of selecting 1 random sample.
For example, if we have a 20 training samples we will use 5 samples for one forward pass to calculate cumulative error and then adjust the weights and biases.
Gradient Descent with Momentum
In this optimization technique we use Exponentially Weighted Averages. What the added momentum does is that it smothens the curve.
An Exponential Moving Average (EMA) is a type of Moving Average (MA) that places a greater weight and significance on the most recent data points. The exponential moving average is also referred to as the exponentially weighted moving average.
We still use mini-batch and still calculate the dw and db, derivative for weights and biases but we add weighted averages to it.
Where,
- V: Moving Average
- β: Momentum Constant
- 𝛿: Gradient
Now the new weights and biases can be calculated as,
Summary
I hope this article provides you with a good understanding of Gradient Descent.
If you have any questions or if you find anything misrepresented please let me know.
Credit: Andrew Ng
Thanks!