Gradient descent algorithm explained with linear regression example

Published in

Intro to Artificial Intelligence

6 min readMay 21, 2020

Gradient descent algorithm is an optimisation algorithm that uses to find the optimal value of parameters that minimises loss function. For instance, the algorithm iteratively adjusts the parameters such as weights and biases of the neural network to find the optimal parameters that minimise the loss function. Before we jump into the algorithm, we need to explain the partial derivatives and gradient.

Partial derivatives and gradient

A partial derivative of a function of multi variables is its derivative with respect to one of those variables, with the others held constant. It gives the rate of change of a function in the direction of a variable. It can determine how changes in the direction of a variable influence the function

Consider the multivariable function f(x, y) = xy²+x³. We can find partial derivatives of the function that is derivatives of function wrt to each variable x and y(a bit of calculus knowledge required to compute the partial derivatives).

We can compute the partial derivative of the function f w.r.t variable x is:

Partial derivate of derivative of function f w.r.t variable y is:

We can say these are the rate of change in the direction of x and y of the function. Now we can define the concept of gradient. In a higher-dimensional function, the gradient is a vector made up of all the partial derivatives of the function. In the above case, both partial derivatives of x and y are included in the gradient vector. In lower

The gradient of the function f(x,y) is represented as below:

The slope of a tangent line. Source: [7]

Intuitively, a derivative of a function is the slope of the tangent line that gives a rate of change in a given point as shown above. If the function is higher-dimensional we have to find the partial derivatives to find the rate of change of the function at a given point. In higher dimension, a gradient is a vector that contains partial derivatives to determine the rate of change. We can consider gradient as the slope in a higher dimensional function. In a lower-dimensional function, the gradient is a slope of the tangent line that determines the rate of change at a given point.

Slope measurse both the direction and the steepness of the line

The gradient gives the direction of the maximum change and the magnitude indicates the maximum rate of change. The gradient always points in the direction of the steepest increase in the objective function.

J(θ0, θ1) is the cost function, and θ0 and θ1 are the variable or parameters. The idea here is to find the optimal value for these parameters to minimise the cost function is called gradient descent algorithm. Source: [2]

If we update variables or parameters of some cost function in the direction of the negative gradient in an iterative manner to reach the minimum of some cost function is called gradient descent algorithm.

Let’s explain this concept in the context of linear regression.

Overall, we need to train the regression model with historical data so that it can predict Y with high accuracy. Linear regression works by finding the coefficients of a line that best fit the historical data to predict y. We can represent a line with the equation Y= mX +b, where m and b are the coefficients or variables of the function. Our task is to use gradient descent approach to optimize the coefficients of the regression model such a way that it has less prediction error.

Steps for the gradient descent

The below pseudo-code is a modified version from the source: [4]

1. Initialise the coefficients m and b with random values

For example m = 1 and b =2, i.e a line equation is y = mx+b, then y = 1*x+2. This can give a line that is not best fitted for the historical data, as shown below.

The line is not best fitted, and the idea here is to find the optimal values of the line’s coefficients that can best fit the historical data. Once we find the best line that fit the data, we can use this regression model for prediction.

2. Compute the gradient.

We use the Sum of Squared Errors (SSE) as our loss/ cost function to minimise the prediction error. In this case, the gradient of SSE is a partial derivative of SSE w.r.t m and partial derivative of SSE w.r.t b.

SSE equation is as follows:

The partial derivative of SSE w.r.t m is:

The partial derivative of SSE w.r.t b is:

Finally, the gradient is made up of all the partial derivatives i.e. both partial derivatives of SSE as shown below:

As we discussed, gradients give the direction of the movement of m and b w.r.t to SSE.

In the first step, we initialize m and b with random values. For the subsequent iterative process, m and b values are updated using step 3. In this step, we compute the partial derivative of SSE w.r.t m and partial derivative w.r.t b using the above equation for each data point in X. Finally, calculate the sum of all partial derivatives f w.r.t m and all partial derivatives f w.r.t b. In other words, we compute the gradient of SSE for the data X.

3. Update coefficients in the direction of optimal m and b

We can update the coefficients m and b using the gradient calculated from the above equations.

Update rule of coefficient m is:

Update rule of coefficient b is:

Here r is the learning rate.

4. Use new m and b for prediction

We use the data X with new m and b, computed in the above step, to draw the line that fit the data. We calculate the SSE of each data point in X to find out the total SSE, which is a sum of squared errors of data points, divided of 2. The total SSE indicates the error rate of the model prediction.

5. Repeat steps 2, 3 and 4

This diagram shows the overall steps of the gradient descent algorithm. Here, computing the gradient of loss function at a specific point in the graph and then update the weight w(that is the variable of loss function) with the gradient to reach the loss minimum value. This process repeats until reach the minimum. Source:[5]

We need to repeat the steps 2, 3 and 4 until optimal values for the coefficients m and b are found that reduces the SSE to a minimum value. The optimal values of m and b enable the model to predict the Y with the highest accuracy.

We found the best line that fit the data for our regression model using the gradient descent algorithm.

This is the overall intuitive explanation of the gradient descent algorithm. When you think about gradient descent algorithm for a neural network, the whole approach we explained here is the basis.

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.