Gradient Descent Part 2: The Math

Published in

Analytics Vidhya

5 min readJan 12, 2021

Inside the mathematics of Gradient Descent.

In the first part of this article, we saw an intuitive understanding of Gradient Descent and some of the concepts required for mathematical understanding of it. In this article, we are going to dive into mathematical details of Gradient Descent. If you have not read the first intuitive part of this blog, I would suggest you to go and read that blog by clicking here.

Outline of this blog:

What is optimization in machine learning?
How we can perform optimization using Gradient Descent?

We are ready to go now,

1. What is optimization in machine learning?

You must be wondering why we are reading about this optimization thing, the reason is simple, Gradient Descent is an Optimization Algorithm and to understand it we must know what is optimization first. Optimization is in simple terms “The action of making the best or most effective use of a situation or resource.” But in terms of Machine Learning what are those situations or resources? As we use machine learning algorithms to perform some predictive tasks, every algorithm has some loss function to measure the performance of that algorithm for performing a predictive task. This loss function helps us to know how well the machine learning algorithm is predicting the desired output. Lower the loss better is the algorithm, so the aim of any machine learning task is to achieve minimum loss and thus maximum accuracy. Loss is basically an error measurement technique and also known as the cost function. We can call it a function that calculates the difference between the desired output and what our model is predicting.

The goal of optimization is to find the best parameters of a particular algorithm that minimizes the loss function. We have already seen what is minima and maxima in the previous blog, the optimization tries to find the minima of loss function which is the value of ‘x’ for which we get a minimum value of ‘f(x)’.

Cool, it was not very complicated.

2. How we can perform optimization using Gradient Descent?

As we discussed above we build a machine learning model to perform a predictive task, and thus it always has a cost/loss function. Now here in this section, we will see how we do optimization using Gradient Descent. Look at the following beautiful graph,

Image Courtesy: https://paperswithcode.com/method/sgd

Here on the ‘X’ axis, we have the ‘w’ variable and on the ‘Y’ axis we have J(w) i.e. some function of ‘w’.

Following are the steps to perform optimization:

First, initialize the parameters of the function J(w) i.e. set the value of ‘w’ randomly,

then, for each iteration calculate new ‘w’ by using the following formula,

w_new = w_old - r*(dJ(w)/dw)

and iterate until we find the minimum J(w) i.e. until we get the value of ‘w’ which gives minimum J(w) as shown in the above graph. Here(dJ(w)/dw) is the slope or gradient (a derivative of loss function w.r.t. the parameters) and ‘r’ is the learning rate or the step size.

The learning rate or the step size ‘r’ is a constant value that we use to define by how much value the cost function should converge for each iteration.

The math also is not that complicated, right? Now whatever we saw above we will see all that in the form of code,

weights = 0.01 # intial value we can keep any random value
step_size = 0.001for epoch in range(100):
    weights_grad = evaluate_gradient(loss_fun, data, weights)
    weights = weights - step_size * weights_grad # parameter update

Just like it’s mathematics, the code is also simple. Trust me this is the core idea of optimization which is used in a lot of machine learning algorithms, if you understand this you will be able to understand the other methods easily.

Okay, the Gradient Descent we saw above is also known as the Batch Gradient Descent because it considers all the data points in the dataset for calculation for each iteration, and updates the weights after the iteration with the whole dataset, this can be good for small datasets but in the case of large datasets, considering all datapoints becomes computationally very expensive.

So to resolve this problem researchers have come with a nice variant of GD which is known as Stochastic Gradient Descent(SGD). In SGD the only difference is we use a small random subset of data instead of the whole dataset and the weight updating happens for every datapoint instead of after the whole iteration. This makes the algorithm computationally less expensive even for large datasets. The rest of the part is the same as Batch GD. Let’s see the code for SGD,

weights = 0.01 # intial value we can keep any random value
step_size = 0.001
for epoch in range(100):   
    np.random.shuffle(data)   
    for example in data:     
        weights_grad = evaluate_gradient(loss_fun, example, params)
        weights = weights - step_size * weights_grad

Well, there is another variant of GD, the Mini Batch Gradient Descent. We will see code directly for this and you understand the difference yourself that how it is different from the other two variants.

weights = 0.01 # intial value we can keep any random value
step_size = 0.001
batch_size = 50
for epoch in range(100):
    for batch in np.random.choice(data, batch_size): 
        weights_grad = evaluate_gradient(loss_fun, batch, weights)
        weights = weights — step_size * weights_grad

I know you got the difference in this type of GD, yes you are smart and guessed correctly that Mini-batch gradient descent performs a weight update for every mini-batch of training examples.

Great, so far you have learned about the very common and the most important Optimization algorithm used in the field of Machine Learning.

So what’s next? We should check how this optimization works for the actual Machine Learning algorithms. I will cover this thing in the future blog for sure. And thank you so much for making it to the end. See you in the next blog.

References:

Gradient Descent Part 2: The Math

1. What is optimization in machine learning?

2. How we can perform optimization using Gradient Descent?

Written by Harshwardhan Jadhav