Mastering Gradient Descent: Optimizing Neural Networks with Precision.

8 min readMar 10, 2024

Part 5: Mathematical Insights into Gradient Descent.

Here is the mathematical formula for gradient descent:

It represents the core of the Gradient Descent algorithm. Thid is the update rule for the parameters (Ө) in the context of the gradient descent optimization algorithm. Let’s break down each component and explain the meaning of the formula:

θ represents the parameters (or weights) of the model. In the context of linear regression, for example, θ would include the slope and the intercept of the line.

α is the learning rate.

∇J(θ) is the gradient (or derivative) of the cost function J(θ) with respect to the parameters θ. It is a vector that points in the direction of the steepest increase of the cost function. The negative of the gradient (−∇J(θ)) points in the direction of the steepest decrease, which is the direction we want to move to minimize the cost function.

The Gradient Descent algorithm repeats this update rule iteratively until the cost function converges to a minimum. The updated parameters are then used to make predictions.

Let’s consider a cost function with 2 variables or parameters (θ₁ and θ₂):

Our goal is to find the values of x and y that minimize J(θ₁,θ₂). In mathematical terms, we want to solve the optimization problem minimize J(θ₁,θ₂):

Update rules are used to iteratively update the parameters in the direction that reduces the function’s value the most, aiming to find its minimum. Update rules for the parameters (θ₁ and θ₂) of a function J(θ₁, θ₂) are as follows:

Here, (d/dθ₁) J(θ₁, θ₂) and (d/dθ₂) J(θ₁, θ₂) represent the partial derivatives of the function with respect to θ₁ and θ₂, respectively. They give the rate of change of the function with respect to each parameter.

In each iteration of the algorithm, the parameters θ₁ and θ₂ are updated according to these rules. This process continues until the algorithm converges, i.e., until the change in the function’s value between iterations falls below a predefined threshold.

Now let’s find the derivatives of the cost function J(θ₁, θ₂) = θ₁² + θ₂² with respect to θ₁ and θ₂. These derivatives are used in the update rules of the Gradient Descent algorithm to adjust the parameters θ₁ and θ₂ and minimize the cost function.

The derivatives are as follows:

These derivatives represent the slopes of the cost function at a given point (θ₁, θ₂) with respect to θ₁ and θ₂, respectively. They tell us how much J(θ₁, θ₂) will change if we make a small change in θ₁ or θ₂. In the context of Gradient Descent, these derivatives are used to update the parameters θ₁ and θ₂(backpropagation) as follows:

Here, α (alpha) is the learning rate, which controls the size of the steps taken in the direction of the negative gradient.

Here’s python implementation of this problem:

Here’s simpler example for better understanding:

Now, let’s break down how Gradient Descent works for Linear Regression with a Mean Squared Error (MSE) cost function:

Step 1: Define the Model

Assuming a simple linear regression model with one feature x, the linear regression equation is given by: y=mx+b

where:

y is the dependent variable (output),
x is the independent variable (input),
m is the slope of the line (weight), and
b is the y-intercept (bias).

Step 2: Initialize Parameters:

Choose initial values for the parameters (e.g., the weights in a linear regression model). The initial values can be zeros, ones, or a small random number or values learned from a previous training session.

Step 3: Define the Cost Function

A commonly used cost function for linear regression is the Mean Squared Error (MSE). The Mean Squared Error for a set of predictions y^ and actual values y is given by:

Where, N is the number of training examples.

We want to find the best values for m and b that minimize the MSE

Step 4: Compute the Derivative of the Cost Function

To perform gradient descent, we need to compute the derivative of the cost function with respect to the model parameters m and b. Let’s find the partial derivatives of the MSE with respect to m and b.

We want to find partial derivative of J(m,b) w.r.t to m:

Let’s Define a variable u:

In calculus, when we have a function nested inside another function, we often need to use the chain rule to differentiate it. This is called a composite function.

In our case, the composite function is ( u² ), where ( u = (mx_i + b) — y_i ). Here, ( u² ) is the outer function and ( u = (mx_i + b) — y_i ) is the inner function.

The chain rule states that to differentiate a composite function, you differentiate the outer function and then multiply it by the derivative of the inner function.

So, if we apply the chain rule to our function: We first differentiate the outer function ( u² ) with respect to ( u ), which gives us ( 2u ). Then we differentiate the inner function ( u = (mx_i + b) — y_i ) with respect to ( m ). When we are taking the derivative with respect to ( m ), we treat all other variables as constants.

Derivative of ( mx_i ) with respect to ( m ): Here, ( x_i ) is treated as a constant. The derivative of ( m ) (with respect to ( m )) is 1. So, the derivative of ( mx_i ) is ( 1 * x_i = x_i ).
Derivative of a constant with respect to ( m ): The derivative of any constant with respect to a variable is 0. In this case, ( b — y_i ) is treated as a constant because it does not contain the variable ( m ). So, its derivative with respect to ( m ) is 0.

Therefore, the derivative of ( (mx_i + b) — y_i ) with respect to ( m ) is ( x_i — 0 = x_i ).

So, we have:

Substitute this back into the equation: This gives us –

Hence, J’(m,b) = J(m,b) [2x]

Repeat the same process to find partial derivative of J(m,b) w.r.t to b:

Let’s Define a variable u:

if we apply the chain rule to our function, we first differentiate the outer function ( u² ) with respect to ( u ), which gives us ( 2u ). Then we differentiate the inner function ( u = (mx_i + b) — y_i ) with respect to ( b ). When we are taking the derivative with respect to ( b ), we treat all other variables as constants.

Derivative of ( b ) with respect to ( b ): The derivative of a variable with respect to itself is always 1. This is because if you increase ( b ) by a tiny amount, ( b ) itself will increase by the same amount. So, the derivative of ( b ) with respect to ( b ) is 1.
Derivative of a constant with respect to ( b ): The derivative of any constant with respect to a variable is 0. In this case, ( mx_i — y_i ) is treated as a constant because it does not contain the variable ( b ). So, its derivative with respect to ( b ) is 0.

Therefore, the derivative of ( (mx_i + b) — y_i ) with respect to ( b ) is ( 1–0 = 1 ).

So, we have –

Substitute this back into the equation: This gives us -

Hence, J’(m,b) = J(m,b) [2]

These partial derivatives give us the gradient (direction of steepest ascent) of the cost function. In gradient descent, we want to move in the opposite direction (direction of steepest descent) to minimize the cost function.

Step 5: Update Parameters:

Subtract a fraction of the gradient from the current parameters to move towards the direction of steepest decrease of the cost function. The size of the step is determined by the learning rate. So, in each iteration of the gradient descent algorithm, we update m and b as follows:

where α is the learning rate.

Repeat Steps 3–5: This process is repeated until the cost function converges to the minimum value. Goal: min J(m,b)

Note — You can alternatively find partial derivative w.r.t slope(m) and intercept(c) as follow:

Note — The Mean Squared Error (MSE) cost function is defined as:

MSE = (1/N) * Σ(actual — predicted)²

Here, (N) is the total number of observations. The division by (N) is used to calculate the average of the squared differences between the predicted and actual values, hence the name “Mean Squared Error”.

One Half Mean Squared Error: We multiply our MSE cost function with ½. Because, when you take the derivative of the cost function with respect to the model parameters, the 2 in (actual — predicted)2 gets cancelled with the 2 in ½ resulting in a cleaner derivative expression. So, while it’s technically correct to write the cost function as MSE = (1/N) * Σ(actual — predicted)², it’s common to simplify the derivative by writing the cost function as MSE = (1/(2m)) * Σ(actual — predicted)².

The division by (2) that we’re referring to is sometimes used in the cost function for computational convenience. When you take the derivative of the cost function during the gradient descent algorithm, the (2) in the exponent cancels out with the (1/2) factor. This can simplify the calculations. However, whether you include the (1/2) factor or not doesn’t change the location of the minimum of the cost function. It only changes the scale of the cost function. The parameters (m) and (b) that minimize the cost function will be the same in either case.

Below is a simple Python implementation of Gradient Descent for linear regression using the Mean Squared Error (MSE) as the cost function. This example assumes a univariate linear regression (one feature). You can extend it for multivariate linear regression by adjusting the feature matrix accordingly.

Illustration of how Gradient Descent find the optimal parameters for a Linear Regression. Reference

Now, let’s have a look at common error while implementing gradient descent:

Closing note — As we conclude Part 5, we’ve delved into the mathematical underpinnings of gradient descent, unraveling its complexity. Let this knowledge deepen your understanding and guide your pursuit of optimization mastery. Stay determined, stay curious. Part 6 is just ahead, promising more insights. Until then, keep exploring, keep learning, and let’s continue our journey through the fascinating world of machine learning together!

Mastering Gradient Descent: Optimizing Neural Networks with Precision.

Written by om pramod