A ‘run down’ on Gradient Descent

6 min readJun 21, 2024

Intuition, Mathematics and Code for Gradient Descent. You know this well, and you are ready for anything the ML world throws at you. Well, almost anything!

Image: Mr. BaaBaa is sliding down the hill to reach the lake at the bottom. (AI generated image based on author’s prompt)

Since a long time, I wanted to write about the most important algorithm out there in ML. For some reason, I didn’t. Last week, while I was speaking to some of my mates, I realized how critical it is to revisit these concepts, no matter whether we want to understand the internals of an ML or a DL algorithm. Gradient Descent is the cornerstone in Machine Learning and there is no escaping it. Not that I would want to! And you are in the right place if you want to know about Gradient Descent — its intuition, the mathematics behind, as well the code to implement it from scratch.

Intuition

In Machine Learning, we often estimate the solution by studying the pattern in the data, rather than actually solving it. How do we do that? You can think of it as an iterative approach where you make a smart guess of the solution and check how far off it is from the actual answer. Next, tweak your solution a bit to get closer to the original answer. Repeat this process a few times and there you have it — an optimal solution for the problem at hand, without actually solving it mathematically.

Definition

Gradients are nothing but slopes for a multi-axis system. This algorithm is supposed to help us descend down the slope of the loss function to approach its global minima. Hence, this helps to converge to a near optimal solution. Thus, the name — Gradient Descent.

Formal definition: Gradient Descent is an optimization algorithm used to minimize the cost or loss function in various machine learning models. By iteratively adjusting the model’s parameters in the direction of the steepest descent, gradient descent helps find the optimal set of parameters that minimize the loss function, thus improving the model’s performance.

Why is it essential

Gradient Descent is a cornerstone in machine learning and optimization and is used across algorithms including simple linear regression, clustering, as well as complex deep learning models. It is vital as it provides an efficient, scalable, and adaptable method for optimizing machine learning models, forming the basis for many advanced optimization techniques.

Many advanced optimization techniques, such as those used in neural network training (e.g., Adam, RMSProp, AdaGrad), build upon the basic principles of gradient descent, enhancing its effectiveness and speed.

Mathematical Foundation

From a set of data points, how do we estimate a function that represents it? We can start by a random guess of the different parameters! Let’s say we have only feature that influences the target variable. To make this more concrete, we will consider that the weight of a person is influenced by their height. So, the target (or dependent) variable is weight, while the feature (or independent) variable is height. And we make an assumption that we expect a linear relationship between the two variables.

Eq 1: Relationship between height and weight where the intercept and coefficient are the parameters

Eq 2: Generalized Eq 1 to represent y as a function of X

Steps:

Start with a random guess for both w0 & w1.
Calculate y using Eq 2 by substituting w0 and w1 as per Step 1.
Calculate the loss, i.e., how far off is the calculated y from the actual y.

Eq 3: Loss function

4. Calculate the slope (or gradient) of the equation. We do this to find out the direction in which we will update the parameters w0 and w1 to get closer to the actual y.

To calculate the slope, we would need to take the (partial) derivative of Eq 3 w.r.t. both w0 and w1. Remember that we would need to go down the hill by changing both the parameters.

Eq 4A: derivative of loss function w.r.t. w0

Eq 4B: derivative of loss function w.r.t. w1

5. Now that we know the slope, it is time to update w0 and w1 simultaneously. We will make sure that the progression is smooth by introducing a learning rate ⍺ (generally initialized at 0.01).

Eq 5A: update w0

Eq 5B: update w1

6. Repeat Step 2 onwards (calculation of loss) till you meet either of these stopping criteria:

loss is less than a threshold
number of iterations is equal to a threshold

Types of Gradient Descent

Think about it. The vanilla GD algorithm will update the model parameters (slightly) only after an entire epoch, i.e., after we pass the full dataset. This would be sub-optimal, given the size of datasets could range to millions of rows. It is found that we can do better (in terms of latency of parameter update), if we update the model parameters after we pass through a much smaller sample of rows. This is called Mini-Batch Gradient Descent.

Similarly, for different use cases, we have a few variants of GD:

Batch gradient descent (parameters are updated after one full pass)
Stochastic gradient descent (parameters are updated after every sample)
Mini batch gradient descent (parameters are updated after a set of samples)

Python Implementation

Here is the core function that calculates the gradients.

def gradient_descent(x, y, w:float, b:float, alpha:float=0.01):
    '''
    optimize parameters w, b using gradient descent
    '''
    
    dldw = 0
    dldb = 0
    N = x.shape[0]
    
    # loss = (y - yhat)**2
    # or, loss = (y - (w*x + b))**2
    for xi, yi in zip(x,y):
        dldw += -2 * xi * (yi - (w*xi + b))
        dldb += -2 * (yi - (w*xi + b))
        
    w = w - (alpha * (1/N) * dldw)
    b = b - (alpha * (1/N) * dldb)
    
    return w, b

We also will need some boiler plate code to make this work. Here is the code chunk that calls the above function.

# libraries
import numpy as np

w = 0.0  # np.random.randn()
b = 0.0  # np.random.randn()
alpha = 0.01

losses = []
w_list = []
b_list = []
for epoch in range(400):
    w, b = gradient_descent(X, y, w, b, alpha)
    yhat = w*X + b
    loss = np.divide(np.sum((y - yhat)**2, axis=0), X.shape[0])
    losses.extend(loss)
    w_list.extend(w)
    b_list.extend(b)
    if epoch%40==0:  # print every 40th epoch
        print(f"{epoch} epoch: loss is {loss} | w={w} | b={b}")

Please note that this implementation is for batch gradient descent. You can try to tweak the code to create a version for mini-batch gradient descent.

Conclusion

If you have read so far, I am sure you would have learnt a few useful things around Gradient Descent. I am confident that would feel more at home the next time you come across more advanced optimization algorithms like Adam. Cheer up as this will give you a serious boost to your learning, even in the domain of deep learning.