“Understanding Gradient Descent: Intuition, Mathematical Formulation, and Derivation”

Batch-gradient-descent

13 min readApr 28, 2023

Gradient Descent is a popular optimization algorithm used in machine learning to minimize a function by iteratively adjusting its parameters. The basic idea is to move the parameters of the function in the direction of steepest descent of the function’s gradient.

In simpler terms, imagine you are at the top of a mountain, and you want to get to the bottom as quickly as possible. You could start walking downhill, taking steps in the direction that feels the steepest. However, you might not always be able to tell which direction is truly the steepest, especially if the terrain is complex.

Instead, you could use a tool that measures the steepness of the terrain in all directions, like a compass that points downhill. With this tool, you could take small steps in the direction that feels the steepest according to the compass, until you reach the bottom. Gradient Descent is like this compass for machine learning models: it tells the model how to adjust its parameters to minimize the function it’s trying to optimize.

Specifically, Gradient Descent works by computing the gradient of the function at the current point, which represents the direction of steepest ascent. The algorithm then takes a small step in the opposite direction, towards the steepest descent. This process is repeated many times, with the size of each step determined by a learning rate, until the algorithm reaches a minimum as we move towards to minima our step size (learning rate) is decreasing. initially our step size is high.

In machine learning, the function being optimized is usually a cost function that measures how well the model is performing on a given task. By using Gradient Descent to minimize this cost function, the model can improve its performance over time, learning to make better predictions or classifications.

Here’s the stepwise mathematical formulation and derivation for finding the value of b in a simple linear regression using gradient descent, assuming we already have the value of m:

The equation for a simple linear regression is y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept.
Let’s say we have a set of n data points (x1, y1), (x2, y2), …, (xn, yn) and we want to find the values of m and b that minimize the sum of squared errors (SSE):
SSE = Σ(yi — (mx_i + b))², i = 1 to n
To find the value of b using gradient descent, we first need to calculate the partial derivative of SSE with respect to b:
∂SSE/∂b = Σ-2(yi — (mx_i + b))
Next, we update the value of b using the gradient descent update rule:
b_new = b — alpha * ∂SSE/∂b
where alpha is the learning rate. We repeat this process until the value of b converges to a minimum.

Here’s the stepwise derivation:

Initialize b to a random value.
Calculate the gradient of SSE with respect to b:
∂SSE/∂b = Σ-2(yi — (mx_i + b))
Update b using the gradient descent update rule:
b_new = b — alpha * ∂SSE/∂b
Repeat steps 2 and 3 until the value of b converges to a minimum.

Here’s an example Python code for finding the value of b using gradient descent:

from sklearn.datasets import make_regression
import numpy as np
X,y = make_regression(n_samples=4, n_features=1, n_informative=1, n_targets=1,noise=80,random_state=13)
import matplotlib.pyplot as plt
plt.scatter(X,y)

# Lets apply OLS
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X,y)
     
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
reg.coef_
     
>>>>>> array([78.35063668])
reg.intercept_
     
>>>>> 26.15963284313262
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red')reg = LinearRegression()
reg.fit(X,y)
     
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
reg.coef_
     
>>>>>> array([78.35063668])
reg.intercept_
     
>>>>> 26.15963284313262
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red')

# Lets apply Gradient Descent assuming slope is constant m = 78.35
# and let's assume the starting value for intercept b = 0
y_pred = ((78.35 * X) + 100).reshape(4)
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred,color='#00a65a',label='b = 0')
plt.legend()
plt.show()

m = 78.35
b = 100
loss_slope = -2 * np.sum(y - m*X.ravel() - b)
loss_slope
     
>>>> 590.7223659179078
# Lets take learning rate = 0.1
lr = 0.1
step_size = loss_slope*lr
step_size
     
>>>>> 59.072236591790784
# Calculating the new intercept
b = b - step_size
b
     
>>>> 40.927763408209216
y_pred1 = ((78.35 * X) + b).reshape(4)
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred1,color='#00a65a',label='b = {}'.format(b))
plt.plot(X,y_pred,color='#A3E4D7',label='b = 0')
plt.legend()
plt.show()

# Iteration 2
loss_slope = -2 * np.sum(y - m*X.ravel() - b)
loss_slope
     
>>>>> 118.14447318358157
step_size = loss_slope*lr
step_size
     
>>>>> 11.814447318358157
b = b - step_size
b
     
>>>>> 29.11331608985106
y_pred2 = ((78.35 * X) + b).reshape(4)
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred2,color='#00a65a',label='b = {}'.format(b))
plt.plot(X,y_pred1,color='#A3E4D7',label='b = {}'.format(b))
plt.plot(X,y_pred,color='#A3E4D7',label='b = 0')
plt.legend()
plt.show()

loss_slope = -2 * np.sum(y - m*X.ravel() - b)
loss_slope
     
>>>> 23.62889463671634
step_size = loss_slope*lr
step_size
     
>>>> 2.362889463671634
b = b - step_size
b
     
>>>> 26.750426626179426
y_pred3 = ((78.35 * X) + b).reshape(4)
plt.figure(figsize=(15,15))
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred3,color='#00a65a',label='b = {}'.format(b))
plt.plot(X,y_pred2,color='#A3E4D7',label='b = {}'.format(b))
plt.plot(X,y_pred1,color='#A3E4D7',label='b = {}'.format(b))
plt.plot(X,y_pred,color='#A3E4D7',label='b = 0')
plt.legend()
plt.show()

b = -100
m = 78.35
lr = 0.01
epochs = 100
for i in range(epochs):
  loss_slope = -2 * np.sum(y - m*X.ravel() - b)
  b = b - (lr * loss_slope)
  y_pred = m * X + b
  plt.plot(X,y_pred)
plt.scatter(X,y)

“Visualizing Gradient Descent: A Step-by-Step Example of Finding the y-Intercept.”

Best Fit Line:

**trying to find b(intercept) when we already have slope**

The graph illustrates the relationship between the number of epochs and the loss.

**Loss VS epochs (red markers →learning Rate)**

The graph illustrates the relationship between the number of epochs and the loss, while also indicating the learning rate used in the optimization algorithm. As the number of epochs increases, the loss decreases, indicating that the model is learning and improving its performance on the task. However, it is important to note that the learning rate is not constant throughout the training process.
At the beginning of the training, the learning rate is high, which allows the model to make large updates to its parameters and quickly converge towards a minimum of the loss function. As the model gets closer to the minimum, the learning rate is reduced to avoid overshooting the minimum and to ensure a more stable convergence.

The learning rate.

The learning rate is a hyperparameter that controls the size of the steps taken in each iteration of the gradient descent algorithm. If the learning rate is too low, the algorithm may converge slowly, while if it is too high, the algorithm may overshoot the minimum and diverge.

Here’s a diagrammatic representation of how different learning rates affect gradient descent:

In the diagram, we have a simple 2D cost function with a single minimum. The blue dots represent data points and blue line is our best fits lines of the gradient descent algorithm with different learning rates.
On the left, we have a low learning rate. The algorithm takes small steps in the direction of the gradient and converges slowly to the minimum.
In the middle, we have a moderate learning rate. The algorithm takes larger steps and converges faster to the minimum.
On the right, we have a high learning rate. The algorithm takes very large steps and overshoots the minimum, causing the algorithm to diverge and oscillate around the minimum.

It’s important to choose an appropriate learning rate for the specific problem at hand, as different problems may have different optimal learning rates. A common approach is to try different learning rates and observe the convergence behavior of the algorithm.

Reducing Loss: Optimizing Learning Rate | Machine Learning | Google Developers

stepwise mathematical formulation and derivation for finding the values of m and b in a simple linear regression using gradient descent:

The equation for a simple linear regression is y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept.
Let’s say we have a set of n data points (x1, y1), (x2, y2), …, (xn, yn) and we want to find the values of m and b that minimize the sum of squared errors (SSE):
SSE = Σ(yi — (mx_i + b))², i = 1 to n
To find the values of m and b using gradient descent, we need to calculate the partial derivatives of SSE with respect to m and b:
∂SSE/∂m = Σ-2x_i(yi — (mx_i + b))
∂SSE/∂b = Σ-2(yi — (mx_i + b))
Next, we update the values of m and b using the gradient descent update rule:
m_new = m — alpha * ∂SSE/∂m
b_new = b — alpha * ∂SSE/∂b
where alpha is the learning rate. We repeat this process until the values of m and b converge to a minimum.

Here’s the stepwise derivation:

Initialize m and b to random values.
Calculate the gradients of SSE with respect to m and b:
∂SSE/∂m = Σ-2x_i(yi — (mx_i + b))
∂SSE/∂b = Σ-2(yi — (mx_i + b))
Update m and b using the gradient descent update rule:
m_new = m — alpha * ∂SSE/∂m
b_new = b — alpha * ∂SSE/∂b
Repeat steps 2 and 3 until the values of m and b converge to a minimum.

Here’s an example Python code for finding the values of m and b using gradient descent:

from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
X,y = make_regression(n_samples=100, n_features=1, n_informative=1, n_targets=1,noise=20,random_state=13)
plt.scatter(X,y)

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)
print(lr.coef_)
print(lr.intercept_)
>>>> [28.12597332]
>>>>-2.271014426178382
y_pred = lr.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
>>>> 0.6345158782661013
class GDRegressor:
    
    def __init__(self,learning_rate,epochs):
        self.m = 100
        self.b = -120
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X,y):
        # calcualte the b using GD
        for i in range(self.epochs):
            loss_slope_b = -2 * np.sum(y - self.m*X.ravel() - self.b)
            loss_slope_m = -2 * np.sum((y - self.m*X.ravel() - self.b)*X.ravel())
            
            self.b = self.b - (self.lr * loss_slope_b)
            self.m = self.m - (self.lr * loss_slope_m)
        print(self.m,self.b)
        
    def predict(self,X):
        return self.m * X + self.b
    
gd = GDRegressor(0.001,50)
gd.fit(X_train,y_train)
>>>> 28.159367347119066 -2.3004574196824854
y_pred = gd.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
>>>>> 0.6343842836315579

“Convergence and Visualization: Using Gradient Descent to Find the Best Fit Line”

In this visualization, the “dark side” of the valley represents the minimum point of the loss function, where the model parameters are optimized to produce the best predictions. The shape of the contour plot can vary depending on the complexity of the model and the number of parameters being optimized.

3d representation

Here we try to formulate with the multiple linear regression.

The multiple linear regression model is defined as:

y = β0 + β1x1 + β2x2 + … + βnxn

where β0, β1, β2, …, βn are the coefficients (also known as weights or parameters) of the model that we want to optimize.

To find the optimal values of β0, β1, β2, …, βn, we use batch gradient descent, which involves the following steps:

Note: Batch Gradient descent is not different things its same above Gradient Descent

Initialization: We start with some initial values for the coefficients, say β0 = 0, β1 = 0, β2 = 0, …, βn = 0.
Compute the cost function: We compute the cost function J(β) that measures how well the model fits the data. The cost function is typically defined as the mean squared error (MSE), which is the average of the squared differences between the predicted values and the actual values of the output variable y.
Compute the gradient: We compute the gradient of the cost function with respect to the coefficients β0, β1, β2, …, βn. The gradient is a vector of n+1 elements, where the first element corresponds to β0, and the remaining elements correspond to β1, β2, …, βn.
Update the coefficients: We update the coefficients using the gradient descent update rule:

βj := βj — α * ∂J(β) / ∂βj

where α is the learning rate (a hyperparameter that controls the size of the updates), and j = 0, 1, 2, …, n.

Repeat steps 2–4 until convergence: We repeat steps 2–4 until the cost function converges (i.e., stops decreasing significantly).

Here’s an example of how batch gradient descent can be applied to a high-dimensional multiple linear regression problem:

Suppose we have a dataset with n = 10 features and m = 1000 observations. The goal is to predict the price of a house based on the features such as square footage, number of bedrooms, number of bathrooms, etc.
We can represent the dataset as an m x (n+1) matrix X, where the first column of X is a vector of 1’s (for the intercept term), and the remaining columns represent the n features. We also have a vector y of m elements, which represents the actual prices of the houses.
We can initialize the coefficients β0, β1, β2, …, βn to zero, and set a learning rate α (e.g., 0.01). We can then compute the cost function J(β) as:
J(β) = 1/2m * sum((Xβ — y)²)
where Xβ is the predicted prices of the houses based on the current values of the coefficients β, and the sum is taken over all m observations.
We can then compute the gradient of the cost function as:
∂J(β) / ∂βj = 1/m * sum((Xβ — y) * Xj)
where Xj is the jth column of the matrix X, and the sum is taken over
continue aove
all m observations.
Using the gradient descent update rule, we can update the coefficients as:
βj := βj — α * ∂J(β) / ∂βj
for j = 0, 1, 2, …, n. This update rule essentially moves the coefficients in the direction of the steepest descent of the cost function, until we reach the minimum.
We can repeat the computation of the cost function and gradient, and the update of the coefficients, until the cost function converges (i.e., stops decreasing significantly). Once the coefficients converge, we have the optimal values of β0, β1, β2, …, βn that can best fit the data.
Batch gradient descent has the advantage of being able to converge to a global minimum of the cost function, but it can be computationally expensive for large datasets. There are other optimization algorithms, such as stochastic gradient descent and mini-batch gradient descent, that can be more efficient for large datasets.

Python Code

from sklearn.datasets import load_diabetes
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X,y = load_diabetes(return_X_y=True)
print(X.shape)
print(y.shape)
>>>> (442, 10)
>>>>> (442,)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
reg = LinearRegression()
reg.fit(X_train,y_train)
LinearRegression()
print(reg.coef_)
print(reg.intercept_)
>>>> [  -9.16088483 -205.46225988  516.68462383  340.62734108 -895.54360867
       561.21453306  153.88478595  126.73431596  861.12139955   52.41982836]
>>>> 151.88334520854633
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)
>>>>> 0.4399387660024645
X_train.shape
>>>>>> (353, 10)
class GDRegressor:
    
    def __init__(self,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            # update all the coef and the intercept
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            #print("Shape of y_hat",y_hat.shape)
            intercept_der = -2 * np.mean(y_train - y_hat)
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)
            
            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_
gdr = GDRegressor(epochs=1000,learning_rate=0.5)
gdr.fit(X_train,y_train)
>>>>> 152.0135263267291 [  14.38915082 -173.72674118  491.54504015  323.91983579  -39.32680194
>>>>>  -116.01099114 -194.04229501  103.38216641  451.63385893   97.57119174]
y_pred = gdr.predict(X_test)
r2_score(y_test,y_pred)
>>>>> 0.4534524671450598

Gradient Descent Also Known by Batch-gradient-descent.

here’s a more detailed comparison of the convergence of Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), and Mini-Batch Gradient Descent (MBGD):

BGD:

Computes the gradient of the cost function over the entire training set.
Updates the model parameters once per iteration.
Converges more slowly but steadily towards the minimum of the cost function.
Can be more stable than SGD and MBGD, but may take longer to converge.
Requires more memory and computational resources than SGD and MBGD.