Gradient Descent

Praveen Raj
4 min readMay 9, 2023

--

For a data scientist, it is of utmost importance to get a good grip on the concepts of gradient descent algorithm as it is widely used for optimizing the objective function / loss function related to various machine learning algorithms such as regression & neural networks in order to learn weights.

First, let’s break the complete word which is of two words.
Gradient — A Slope

Descent:- That means descending.

Introduction to Gradient Descent Algorithm

Gradient descent algorithm is an optimization algorithm which is used to minimize the function. For machine learning, it is also termed as the cost function or loss function.
It is the loss function which is optimized (minimized) and gradient descent is used to find the most optimal value of weights which minimizes the loss function.
- Loss function is the measure of the squared difference between actual values and predictions.

What is Gradient Descent?

The gradient descent of a function at any point, represent the direction of decrease or descent of function at that point.

In the below graph, on the Horizontal axis we have our feature , on the vertical axis we have Mean Squared Error (MSE).

It’s a bowl shape which means convex. Therefore we always have a global minima point. This is the point where model generates the smallest possible Mean Squared Error.

Gradient Descent Algorithm is used to find this value. It is used all over the place in machine learning, not just for linear regression, but for training for example some of the most advanced neural network models in deep learning.

How to calculate Gradient Descent?

Step-1: Initialize the weights randomly

Step-2: Update function

a) we calculate the partial derivative of the MSE with respect to w.
b) we multiply this with alpha which is called learning rate. alpha allows us to adjust how big our steps going to be.

Steps are shown as arrows in the previous graph (Decreasing)
c) we subtract this value from our w.
d) we go back to step 2 and repeat it until our MSE doesn’t move, which means partial derivative of MSE becomes zero.
# you can also set up other conditions as well, such as iteration number or threshold on delta between consecutive changes in MSE.

Here alpha means Learning Rate that lies between 0.1 to 1

Learning Rate is the steps that are taken to reach the local minimum of the function(Basically a Hyperparameter).

If the learning rate is too high, it will skip the global minima and will keep bouncing between the convex function of the gradient.

If Gradient Descent is working properly, the cost function should decrease after every iteration. Now, when it comes to iterations to reach the correct minima we should remember that it could take 50 , 50000 or 5 million iterations to reach the minima based on the task at hand.

Step-3: The weights will need to be updated until function minimizes or converges.

Types of Gradient Descent

Batch Gradient Descent

At every iteration of gradient descent, we’re calculating the MSE by iterating through all the data points in our dataset. This gradient descent is called Batch Gradient Descent.

Batch Gradient descent works well when you have small number of data points and small amount of features. Imagine you have hundreds or thousands of features and millions of data points. Well, things get pretty slow at that scenario.

Stochastic Gradient Descent

In this version, at each iteration, we calculate MSE with only one data point. But since we work with only one training example, when the number of training examples is large, we have to do a lot of iterations in total to be able to find optimal values.

Mini Batch gradient descent

In this version at each iteration, we calculate MSE with subset of data points where the number of data points in the subset is k and k < n.
This version is by far the most effective one.
At each iteration, we still capture good amount of training examples, but not all so we can process faster.

--

--