“Understanding Gradient Descent: Intuition, Mathematical Formulation, and Derivation”

Batch-gradient-descent

ajaymehta
13 min readApr 28, 2023

Gradient Descent is a popular optimization algorithm used in machine learning to minimize a function by iteratively adjusting its parameters. The basic idea is to move the parameters of the function in the direction of steepest descent of the function’s gradient.

In simpler terms, imagine you are at the top of a mountain, and you want to get to the bottom as quickly as possible. You could start walking downhill, taking steps in the direction that feels the steepest. However, you might not always be able to tell which direction is truly the steepest, especially if the terrain is complex.

Instead, you could use a tool that measures the steepness of the terrain in all directions, like a compass that points downhill. With this tool, you could take small steps in the direction that feels the steepest according to the compass, until you reach the bottom. Gradient Descent is like this compass for machine learning models: it tells the model how to adjust its parameters to minimize the function it’s trying to optimize.

Specifically, Gradient Descent works by computing the gradient of the function at the current point, which represents the direction of steepest ascent. The algorithm then takes a small step in the opposite direction, towards the steepest descent. This process is repeated many times, with the size of each step determined by a learning rate, until the algorithm reaches a minimum as we move towards to minima our step size (learning rate) is decreasing. initially our step size is high.

In machine learning, the function being optimized is usually a cost function that measures how well the model is performing on a given task. By using Gradient Descent to minimize this cost function, the model can improve its performance over time, learning to make better predictions or classifications.

Here’s the stepwise mathematical formulation and derivation for finding the value of b in a simple linear regression using gradient descent, assuming we already have the value of m:

The equation for a simple linear regression is y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept.

Let’s say we have a set of n data points (x1, y1), (x2, y2), …, (xn, yn) and we want to find the values of m and b that minimize the sum of squared errors (SSE):

SSE = Σ(yi — (mx_i + b))², i = 1 to n

To find the value of b using gradient descent, we first need to calculate the partial derivative of SSE with respect to b:

∂SSE/∂b = Σ-2(yi — (mx_i + b))

Next, we update the value of b using the gradient descent update rule:

b_new = b — alpha * ∂SSE/∂b

where alpha is the learning rate. We repeat this process until the value of b converges to a minimum.

Here’s the stepwise derivation:

  1. Initialize b to a random value.
  2. Calculate the gradient of SSE with respect to b:
  3. ∂SSE/∂b = Σ-2(yi — (mx_i + b))
  4. Update b using the gradient descent update rule:
  5. b_new = b — alpha * ∂SSE/∂b
  6. Repeat steps 2 and 3 until the value of b converges to a minimum.

Here’s an example Python code for finding the value of b using gradient descent:

from sklearn.datasets import make_regression
import numpy as np
X,y = make_regression(n_samples=4, n_features=1, n_informative=1, n_targets=1,noise=80,random_state=13)
import matplotlib.pyplot as plt
plt.scatter(X,y)
Plotting Scatter plot
# Lets apply OLS
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
reg.coef_

>>>>>> array([78.35063668])
reg.intercept_

>>>>> 26.15963284313262
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red')reg = LinearRegression()
reg.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
reg.coef_

>>>>>> array([78.35063668])
reg.intercept_

>>>>> 26.15963284313262
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red')
line predict by OLS method.
# Lets apply Gradient Descent assuming slope is constant m = 78.35
# and let's assume the starting value for intercept b = 0
y_pred = ((78.35 * X) + 100).reshape(4)
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred,color='#00a65a',label='b = 0')
plt.legend()
plt.show()
m = 78.35
b = 100
loss_slope = -2 * np.sum(y - m*X.ravel() - b)
loss_slope

>>>> 590.7223659179078
# Lets take learning rate = 0.1
lr = 0.1
step_size = loss_slope*lr
step_size

>>>>> 59.072236591790784
# Calculating the new intercept
b = b - step_size
b

>>>> 40.927763408209216
y_pred1 = ((78.35 * X) + b).reshape(4)
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred1,color='#00a65a',label='b = {}'.format(b))
plt.plot(X,y_pred,color='#A3E4D7',label='b = 0')
plt.legend()
plt.show()
Comparison different values of b
# Iteration 2
loss_slope = -2 * np.sum(y - m*X.ravel() - b)
loss_slope

>>>>> 118.14447318358157
step_size = loss_slope*lr
step_size

>>>>> 11.814447318358157
b = b - step_size
b

>>>>> 29.11331608985106
y_pred2 = ((78.35 * X) + b).reshape(4)
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred2,color='#00a65a',label='b = {}'.format(b))
plt.plot(X,y_pred1,color='#A3E4D7',label='b = {}'.format(b))
plt.plot(X,y_pred,color='#A3E4D7',label='b = 0')
plt.legend()
plt.show()
loss_slope = -2 * np.sum(y - m*X.ravel() - b)
loss_slope

>>>> 23.62889463671634
step_size = loss_slope*lr
step_size

>>>> 2.362889463671634
b = b - step_size
b

>>>> 26.750426626179426
y_pred3 = ((78.35 * X) + b).reshape(4)
plt.figure(figsize=(15,15))
plt.scatter(X,y)
plt.plot(X,reg.predict(X),color='red',label='OLS')
plt.plot(X,y_pred3,color='#00a65a',label='b = {}'.format(b))
plt.plot(X,y_pred2,color='#A3E4D7',label='b = {}'.format(b))
plt.plot(X,y_pred1,color='#A3E4D7',label='b = {}'.format(b))
plt.plot(X,y_pred,color='#A3E4D7',label='b = 0')
plt.legend()
plt.show()
b = -100
m = 78.35
lr = 0.01
epochs = 100
for i in range(epochs):
loss_slope = -2 * np.sum(y - m*X.ravel() - b)
b = b - (lr * loss_slope)
y_pred = m * X + b
plt.plot(X,y_pred)
plt.scatter(X,y)

“Visualizing Gradient Descent: A Step-by-Step Example of Finding the y-Intercept.”

Best Fit Line:

trying to find b(intercept) when we already have slope

The graph illustrates the relationship between the number of epochs and the loss.

Loss VS epochs (red markers →learning Rate)
Loss VS epochs

The graph illustrates the relationship between the number of epochs and the loss, while also indicating the learning rate used in the optimization algorithm. As the number of epochs increases, the loss decreases, indicating that the model is learning and improving its performance on the task. However, it is important to note that the learning rate is not constant throughout the training process.

At the beginning of the training, the learning rate is high, which allows the model to make large updates to its parameters and quickly converge towards a minimum of the loss function. As the model gets closer to the minimum, the learning rate is reduced to avoid overshooting the minimum and to ensure a more stable convergence.

The learning rate.

The learning rate is a hyperparameter that controls the size of the steps taken in each iteration of the gradient descent algorithm. If the learning rate is too low, the algorithm may converge slowly, while if it is too high, the algorithm may overshoot the minimum and diverge.

Here’s a diagrammatic representation of how different learning rates affect gradient descent:

In the diagram, we have a simple 2D cost function with a single minimum. The blue dots represent data points and blue line is our best fits lines of the gradient descent algorithm with different learning rates.

On the left, we have a low learning rate. The algorithm takes small steps in the direction of the gradient and converges slowly to the minimum.

In the middle, we have a moderate learning rate. The algorithm takes larger steps and converges faster to the minimum.

On the right, we have a high learning rate. The algorithm takes very large steps and overshoots the minimum, causing the algorithm to diverge and oscillate around the minimum.

It’s important to choose an appropriate learning rate for the specific problem at hand, as different problems may have different optimal learning rates. A common approach is to try different learning rates and observe the convergence behavior of the algorithm.

Reducing Loss: Optimizing Learning Rate | Machine Learning | Google Developers

stepwise mathematical formulation and derivation for finding the values of m and b in a simple linear regression using gradient descent:

The equation for a simple linear regression is y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept.

Let’s say we have a set of n data points (x1, y1), (x2, y2), …, (xn, yn) and we want to find the values of m and b that minimize the sum of squared errors (SSE):

SSE = Σ(yi — (mx_i + b))², i = 1 to n

To find the values of m and b using gradient descent, we need to calculate the partial derivatives of SSE with respect to m and b:

∂SSE/∂m = Σ-2x_i(yi — (mx_i + b))

∂SSE/∂b = Σ-2(yi — (mx_i + b))

Next, we update the values of m and b using the gradient descent update rule:

m_new = m — alpha * ∂SSE/∂m

b_new = b — alpha * ∂SSE/∂b

where alpha is the learning rate. We repeat this process until the values of m and b converge to a minimum.

Here’s the stepwise derivation:

  1. Initialize m and b to random values.
  2. Calculate the gradients of SSE with respect to m and b:
  3. ∂SSE/∂m = Σ-2x_i(yi — (mx_i + b))
  4. ∂SSE/∂b = Σ-2(yi — (mx_i + b))
  5. Update m and b using the gradient descent update rule:
  6. m_new = m — alpha * ∂SSE/∂m
  7. b_new = b — alpha * ∂SSE/∂b
  8. Repeat steps 2 and 3 until the values of m and b converge to a minimum.

Here’s an example Python code for finding the values of m and b using gradient descent:

from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
X,y = make_regression(n_samples=100, n_features=1, n_informative=1, n_targets=1,noise=20,random_state=13)
plt.scatter(X,y)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)
print(lr.coef_)
print(lr.intercept_)
>>>> [28.12597332]
>>>>-2.271014426178382
y_pred = lr.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
>>>> 0.6345158782661013
class GDRegressor:

def __init__(self,learning_rate,epochs):
self.m = 100
self.b = -120
self.lr = learning_rate
self.epochs = epochs

def fit(self,X,y):
# calcualte the b using GD
for i in range(self.epochs):
loss_slope_b = -2 * np.sum(y - self.m*X.ravel() - self.b)
loss_slope_m = -2 * np.sum((y - self.m*X.ravel() - self.b)*X.ravel())

self.b = self.b - (self.lr * loss_slope_b)
self.m = self.m - (self.lr * loss_slope_m)
print(self.m,self.b)

def predict(self,X):
return self.m * X + self.b

gd = GDRegressor(0.001,50)
gd.fit(X_train,y_train)
>>>> 28.159367347119066 -2.3004574196824854
y_pred = gd.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
>>>>> 0.6343842836315579

“Convergence and Visualization: Using Gradient Descent to Find the Best Fit Line”

In this visualization, the “dark side” of the valley represents the minimum point of the loss function, where the model parameters are optimized to produce the best predictions. The shape of the contour plot can vary depending on the complexity of the model and the number of parameters being optimized.

3d representation

Here we try to formulate with the multiple linear regression.

The multiple linear regression model is defined as:

y = β0 + β1x1 + β2x2 + … + βnxn

where β0, β1, β2, …, βn are the coefficients (also known as weights or parameters) of the model that we want to optimize.

To find the optimal values of β0, β1, β2, …, βn, we use batch gradient descent, which involves the following steps:

Note: Batch Gradient descent is not different things its same above Gradient Descent

  1. Initialization: We start with some initial values for the coefficients, say β0 = 0, β1 = 0, β2 = 0, …, βn = 0.
  2. Compute the cost function: We compute the cost function J(β) that measures how well the model fits the data. The cost function is typically defined as the mean squared error (MSE), which is the average of the squared differences between the predicted values and the actual values of the output variable y.
  3. Compute the gradient: We compute the gradient of the cost function with respect to the coefficients β0, β1, β2, …, βn. The gradient is a vector of n+1 elements, where the first element corresponds to β0, and the remaining elements correspond to β1, β2, …, βn.
  4. Update the coefficients: We update the coefficients using the gradient descent update rule:

βj := βj — α * ∂J(β) / ∂βj

where α is the learning rate (a hyperparameter that controls the size of the updates), and j = 0, 1, 2, …, n.

  1. Repeat steps 2–4 until convergence: We repeat steps 2–4 until the cost function converges (i.e., stops decreasing significantly).

Here’s an example of how batch gradient descent can be applied to a high-dimensional multiple linear regression problem:

Suppose we have a dataset with n = 10 features and m = 1000 observations. The goal is to predict the price of a house based on the features such as square footage, number of bedrooms, number of bathrooms, etc.

We can represent the dataset as an m x (n+1) matrix X, where the first column of X is a vector of 1’s (for the intercept term), and the remaining columns represent the n features. We also have a vector y of m elements, which represents the actual prices of the houses.

We can initialize the coefficients β0, β1, β2, …, βn to zero, and set a learning rate α (e.g., 0.01). We can then compute the cost function J(β) as:

J(β) = 1/2m * sum((Xβ — y)²)

where Xβ is the predicted prices of the houses based on the current values of the coefficients β, and the sum is taken over all m observations.

We can then compute the gradient of the cost function as:

∂J(β) / ∂βj = 1/m * sum((Xβ — y) * Xj)

where Xj is the jth column of the matrix X, and the sum is taken over

continue aove

all m observations.

Using the gradient descent update rule, we can update the coefficients as:

βj := βj — α * ∂J(β) / ∂βj

for j = 0, 1, 2, …, n. This update rule essentially moves the coefficients in the direction of the steepest descent of the cost function, until we reach the minimum.

We can repeat the computation of the cost function and gradient, and the update of the coefficients, until the cost function converges (i.e., stops decreasing significantly). Once the coefficients converge, we have the optimal values of β0, β1, β2, …, βn that can best fit the data.

Batch gradient descent has the advantage of being able to converge to a global minimum of the cost function, but it can be computationally expensive for large datasets. There are other optimization algorithms, such as stochastic gradient descent and mini-batch gradient descent, that can be more efficient for large datasets.

Python Code

from sklearn.datasets import load_diabetes
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X,y = load_diabetes(return_X_y=True)
print(X.shape)
print(y.shape)
>>>> (442, 10)
>>>>> (442,)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
reg = LinearRegression()
reg.fit(X_train,y_train)
LinearRegression()
print(reg.coef_)
print(reg.intercept_)
>>>> [ -9.16088483 -205.46225988 516.68462383 340.62734108 -895.54360867
561.21453306 153.88478595 126.73431596 861.12139955 52.41982836]
>>>> 151.88334520854633
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)
>>>>> 0.4399387660024645
X_train.shape
>>>>>> (353, 10)
class GDRegressor:

def __init__(self,learning_rate=0.01,epochs=100):

self.coef_ = None
self.intercept_ = None
self.lr = learning_rate
self.epochs = epochs

def fit(self,X_train,y_train):
# init your coefs
self.intercept_ = 0
self.coef_ = np.ones(X_train.shape[1])

for i in range(self.epochs):
# update all the coef and the intercept
y_hat = np.dot(X_train,self.coef_) + self.intercept_
#print("Shape of y_hat",y_hat.shape)
intercept_der = -2 * np.mean(y_train - y_hat)
self.intercept_ = self.intercept_ - (self.lr * intercept_der)

coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
self.coef_ = self.coef_ - (self.lr * coef_der)

print(self.intercept_,self.coef_)

def predict(self,X_test):
return np.dot(X_test,self.coef_) + self.intercept_
gdr = GDRegressor(epochs=1000,learning_rate=0.5)
gdr.fit(X_train,y_train)
>>>>> 152.0135263267291 [ 14.38915082 -173.72674118 491.54504015 323.91983579 -39.32680194
>>>>> -116.01099114 -194.04229501 103.38216641 451.63385893 97.57119174]
y_pred = gdr.predict(X_test)
r2_score(y_test,y_pred)
>>>>> 0.4534524671450598

Gradient Descent Also Known by Batch-gradient-descent.

here’s a more detailed comparison of the convergence of Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), and Mini-Batch Gradient Descent (MBGD):

BGD:

  • Computes the gradient of the cost function over the entire training set.
  • Updates the model parameters once per iteration.
  • Converges more slowly but steadily towards the minimum of the cost function.
  • Can be more stable than SGD and MBGD, but may take longer to converge.
  • Requires more memory and computational resources than SGD and MBGD.

SGD:

  • Computes the gradient of the cost function over a single training example.
  • Updates the model parameters once per example.
  • Converges faster but with more oscillations towards the minimum of the cost function.
  • Can be more sensitive to the choice of learning rate and the randomness of the data.
  • Requires less memory and computational resources than BGD.

MBGD:

  • Computes the gradient of the cost function over small random subsets of the training data (mini batches).
  • Updates the model parameters once per mini-batch.
  • Offers a compromise between the speed of SGD and the stability of BGD.
  • Converges faster than BGD and with less oscillations than SGD.
  • Requires less memory and computational resources than BGD, but more than SGD.

--

--

ajaymehta

Meet Ajay a blogger and AI/DS expert. Sharing insights on cutting-edge tech, machine learning, data analysis, and their real-world applications.