Gradient Descent from Scratch Using NumPy Only (Linear Regression and Quantile Regression Examples)

6 min readMay 11, 2023

We have been using lots of fancy packages like Tensorflow and PyTorch. It’s time to anatomize how gradient descent can be built from scratch.

This topic has been profoundly and broadly gone through for years. I will not dive too deep into the theory. But talk about how to actualize in Python coding from scratch using NumPy.

Content

Introduction
Gradient Descent in a Nutshell
Linear Regression
Quantile Regression (Smooth Version)

Wait a minute. What is the point of knowing this?

One of the reasons is that it will be helpful if you know the technique to build/customize the script when you make your library or your function from scratch and you need the optimization technique. The other reason is that, personally, it provides a chance to practice calculus and programming. And in the learning process, I discover a technique to handle the function that is not smooth or has no gradient. :D

Gradient Descent in a Nutshell

The reason for using gradient descent is that it can help you to search the optimal point when you can’t find the optimal solution analytically (such as there is no closed-form solution for LASSO).

Minimize a quadratic function by gradient descent. — Image from author

Initialize the parameters: Choose an initial set of parameters for the function you want to optimize.
Define the loss function: Define a loss function that measures how well the function fits the data you’re working with. The goal of gradient descent is to minimize this loss function.
Calculate the gradient: Calculate the gradient of the loss function with respect to the parameters. The gradient is a vector that points in the direction of the steepest increase in the loss function.
Update the parameters: Adjust the parameters of the function in the direction of the negative gradient, i.e., the direction of the steepest decrease in the loss function. This is done by subtracting the gradient vector multiplied by a learning rate from the current parameter values.
Repeat: Repeat steps 3 and 4 until the loss function converges to a minimum or a stopping criterion is met.

Update the parameter θ in each iteration by its gradient; α is the learning rate. — Image from author

Linear Regression

Initialize the true linear model and estimate the beta directly.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# exploratory variables including intercept term
X = np.c_[np.ones(100), np.random.normal(0, 10, (100, 3))]

# true coefficients
b = np.random.normal(0, 3, 4) 

# error term assuming e ~ N(0, sigma2)
e = np.random.normal(0, 3, 100)

# observed data y
y = X @ b + e

# explicit solution of beta for comparison
beta_ols = np.linalg.inv(X.T @ X) @ X.T @ y

Gradient Descent Method

Instead, I used the matrix form to calculate the gradient in each iteration. In each iteration, each data point contributes to the gradient and the beta is updated after calculating the gradient of the beta.

# initialize the beta (b_hat) and parameters
np.random.seed(1)
b_hat = np.random.normal(0, 1, 4)
num_iter = 10000
learning_rate = 0.00001

# update the beta in each iteration
for j in range(num_iter):
    gradient = np.zeros_like(b)

    for i in range(X.shape[0]):
        gradient += (
            -2 * (y[i] - X[i].T @ b_hat) * X[i].T
        )
        
    b_hat -= gradient * learning_rate

Show the result

We can see that the estimated OLS beta is more or less the same as the gradient descent beta since the minimization problem is convex, i.e., there is a global minimum.

print(f'true beta loss: {np.mean((y - X @ b)**2):.4f}')
print(f'gradient descent beta loss: {np.mean((y - X @ b_hat)**2):.4f}')
print(f'OLS beta loss: {np.mean((y - X @ beta_ols)**2):.4f}')

"""
The printed output:

true beta loss: 7.1546
gradient descent beta loss: 7.0327
OLS beta loss: 7.0327
"""

print('true beta:', b)
print('gradient descent beta:', b_hat)
print('OLS beta:', beta_ols)

"""
The printed output:

true beta: [-2.48698503 -1.68054312  2.24188082  1.8311108 ]
gradient descent beta: [-2.25054155 -1.68927792  2.23323135  1.84840765]
OLS beta: [-2.25054157 -1.68927792  2.23323135  1.84840765]
"""

Quantile Regression

The objective function is clearly not smooth… but no worries, we can smoothen it by replacing the indicator function with a sigmoid function. Then, we can start to calculate the gradient of the loss function. :D

Be aware that there is a dx/dx term left due to the chain rule. — Image from author

To calculate the gradients of the beta in quantile regression, we must replace the x term and the dx/dx term in the sigmoid function accordingly.

It looks scary… But sometimes, “brute force” to calculate the terms make you less scared in the next time :D — Image from author

Coding Time

This time, we will compare our estimate with QuantileRegressor in Statsmodels. Before that, we have to build our own QuantileRegression class.

import numpy as np

class QuantileRegression:
    def __init__(self, theta, verbose=False):
        self.theta = theta # theta-quantile
        self.verbose = verbose # to print the loss
        self.G = 10 # to smoothen the indicator function
        
    def fit(self, X, y, num_iter=1000, learning_rate=0.001):
        self.n, self.p = X.shape
        self.params = np.random.uniform(0, 1, self.p)
        
        # initialize the params for gradient descent
        self.num_iter = num_iter
        self.learning_rate = learning_rate
        
        self.gradient_descent(X, y)
    
    def loss(self, X, y):
        # sum of absolute error
        y_hat = X @ self.params
        dev = y - y_hat
        return np.sum(
            (self.theta - self.sigmoid(dev)) * dev
        )
    
    def gradient_descent(self, X, y):
        y_hat = X @ self.params
        current_loss = self.loss(X, y)
        print(f'iter 0/{self.num_iter}: {current_loss:.4f}')

        for i in range(self.num_iter):
            # update params
            grads = self.gradient(X, y, self.params)
            self.params -= grads * self.learning_rate
            
            # print loss
            y_hat = X @ self.params
            current_loss = self.loss(X, y)
            print(f'iter {i+1}/{self.num_iter}: {current_loss:.4f}')
                
    def gradient(self, X, y, params):
        y_hat = X @ self.params
        dev = y - y_hat
        grads = np.zeros(self.p)

        for i in range(len(grads)):
            grads[i] = np.sum(
                (self.theta - self.sigmoid(dev)) * (-X[:, i]) +
                dev * - self.G * self.sigmoid(dev) * (self.sigmoid(dev) - 1) * (-X[:, i])
            )
        return grads
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(self.G * x))

Construct a linear model.

import numpy as np
import statsmodels.api as sm

X = np.c_[np.ones(100), np.random.normal(0, 5, size=(100, 5))]
beta = np.array([10, 1, 2, 3, 4, 5])
y = X @ beta + np.random.normal(0, 1, 100)

Compare with QuantileRegressor in Statsmodels:

quantile = 0.5

custom_qr = QuantileRegression(quantile)
custom_qr.fit(X, y)
print(custom_qr.params)

"""
The result shows:

iter 0/1000: 1308.5330
iter 1/1000: 1278.3226
iter 2/1000: 1248.9111
iter 3/1000: 1219.4484
iter 4/1000: 1189.9791
iter 5/1000: 1160.5096
.
.
.
iter 997/1000: 38.8499
iter 998/1000: 39.0141
iter 999/1000: 38.8499
iter 1000/1000: 39.0141

array([9.85715098, 0.9862178 , 2.01070597, 3.03001827, 3.9743032 ,
       5.02109943])
"""

sm_qr = sm.QuantReg(y, X)
res = sm_qr.fit(quantile)
print(res.summary())

"""
The result shows in below image.
"""

result from QuantReg. — Image from author

You can see that the estimated beta from gradient descent is very close to the true beta and the estimated beta from QuantReg.

Takeaway

No magic in “deep learning”. It’s just math, and you will find it not hard to understand when you zoom in. :D
Smoothening an indicator to make it differentiable.

Reference

[1] Quantile Regression — Wikipedia (Accessed May 10, 2023) Link: Quantile Regression — Wikipedia