A Simple Introduction to Gradient Descent

10 min readMay 17, 2023

Gradient descent is one of the most common optimization algorithms in machine learning. Understanding its basic implementation is fundamental to understanding all the advanced optimization algorithms built off of it.

Background

This article is supplementary to An Introduction to Machine Learning in Python: Simple Linear Regression. It should be read first or in conjunction with this article. It would also be beneficial to have a basic understanding of partial derivatives in calculus because this article also examines the partial derivatives of several variations of the Mean Squared Error (MSE).

Optimization Algorithms

In machine learning, optimization is the process of finding the ideal parameters, or weights, to maximize or minimize a cost or loss function. The global maximum is the largest value on the domain of the function, whereas the global minimum is the smallest value. While there is only one global maximum and/or minimum, there can be many local maxima and minima. The global minimum or maximum of a cost function indicates where a model’s parameters generate predictions that are close to the actual targets. The local maxima and minima can cause problems when training a model, so their presence should always be considered. The plot above shows an example of each.

There are a few major algorithm groups within this category: bracketing, local descent, first-order, and second-order. The focus of this article will be first-order algorithms that use the first derivative for optimization. Within this category, the gradient descent algorithm is the most popular.

Gradient Descent in One Dimension

Gradient descent is a first-order, iterative optimization algorithm used to minimize a cost function. By using partial derivatives, a direction, and a learning rate, gradient descent decreases the error, or difference, between the predicted and actual values.

The idea behind gradient descent is that the derivative of each weight will reveal its direction and influence on the cost function. In the image below, the cost function is f(w) = w², which is a parabola. The minimum is at (0,0), and the current weight is -5.6. The current loss is 31.36, and the line in orange represents the derivative, or current rate of change for the weight, which is -11.2. This indicates the weight needs to move “downhill” — or become more positive — to reach a loss of 0. This is where gradient descent comes in.

By scaling the gradient with a value known as the learning rate and subtracting the scaled gradient from its weight’s current value, the output will minimize. This can be seen in the image below. In ten iterations (w₀ to w₉), a learning rate of 0.1 is used to minimize the cost function.

In the steps for the algorithm below, a weight is represented by w, with j representing its current value and j+1 representing its new value. The cost function to measure the error is represented by f, and the partial derivative is the gradient of the cost function with respect to the parameters. The learning rate is represented by α.

select a learning rate and the number of iterations
choose random values for the parameters
update the parameters with the equation below

repeat step three until the max number of iterations is reached

When taking the partial derivative, or gradient, of a function, only one parameter can be assessed at a time, and the other parameters are treated as constants. For the example above, f(w) = w², there is only one parameter, so the derivative is f`(w) = 2w. The formula for updating the parameter follows:

Using a learning rate of 0.1 and a starting weight of -5.6, the first ten iterations follow:

The table demonstrates how each component of the formula helps minimize the loss. By negating the scaled gradient, the new weight becomes more positive, and the slope of the new gradient is less steep. As the slope becomes more positive, each iteration yields a smaller update.

This basic implementation of gradient descent can be applied to almost any cost function, including those with numerous weights. A few variations of the mean squared error can be considered.

Gradient Descent with the Mean Squared Error (MSE)

What is the MSE?

A popular cost function for machine learning is the Mean Squared Error (MSE).

This function takes finds the difference between the model’s prediction (Ŷ) and the expected output (Y). It then squares the difference to ensure the output is always positive. This means Ŷ or Y can come first when calculating the difference. This is repeated across a set of points with a size of n. By summing the squared difference of all these points and dividing by n, the output is the mean squared difference (error). It is an easy way of assessing the model’s performance on all the points simultaneously. A simple example can be seen below:

In this formula, Ŷ represents a model’s prediction. In regression, the model’s equation may contain one or more weights depending on the requirements of the training data. The table below reflects these situations.

Now, to perform gradient descent with any of these equations, their gradients must be calculated. The gradient contains the partial derivatives for a function:

Each weight’s partial derivative has to be calculated. A partial derivative is calculated in the same manner as a normal derivative, but every variable that is not being considered must be treated as a constant. The gradients for the MSE variations listed above can be examined below.

One Weight

When taking the gradient of the MSE with only one weight, the derivative can be calculated with respect to w. X, Y, and n must be treated as constants. With this in mind, the fraction and sum can be moved outside of the derivative:

From here, the chain rule can be used to calculate the derivative with respect to w:

Now, this can be simplified:

Two Weights

When taking the gradient of the MSE with two weights, the partial derivatives must be taken with respect to both parameters, w₀ and w₁. When taking the partial derivative of w₀, the following variables are treated as constants: X, Y, n, and w₁. When taking the partial derivative of w₁, the following variables are treated as constants: X, Y, n, and w₀. The same steps as the previous example can be repeated. First, the fraction and sum can be moved outside the derivative.

From here, the chain rule can be used to calculate the derivative with respect to each weight:

Finally, they can be simplified.

Notice that the only difference between the equations is X.

Three Weights

When taking the gradient of the MSE with three weights, the partial derivatives must be taken with respect to each parameter. When taking the partial derivative of one weight, X, Y, n, and the other two weights will be treated as constants. The same steps as the previous example can be repeated. First, the fraction and sum can be moved outside the derivative.

From here, the chain rule can be used to calculate the derivative with respect to each weight:

Finally, they can be simplified.

As mentioned previously, the only difference between each partial derivative is the input feature, X. This can be generalized for k weights in the next example.

More Than Three Weights

When taking the gradient of the MSE with k weights, the partial derivatives must be taken with respect to each parameter. When taking the partial derivative of one weight, X, Y, n, and the other k-1 weights will be treated as constants. As seen in the previous example, only the input feature of each partial derivative changes when there are more than two weights.

Matrix Derivation

The formulas above show how to use gradient descent without explicitly taking advantage of vectors and matrices. However, most of machine learning is best understood by using their operations. For a quick overview, see A Simple Introduction to Tensors.

The rest of this article will be dedicated to using matrix calculus to derive the derivative of the MSE. To start, Ŷ and Y should be understood as matrices with sizes of (n samples, 1). Both are matrices with 1 column and n rows, or they can be viewed as column vectors, which would change their notation to lowercase:

The MSE is element-wise vector subtraction between ŷ and y, followed by the dot product of the difference with itself. Remember, the dot product can only occur if sizes are compatible. Since the goal is to have a scalar output, the first vector must be transposed.

From here, ŷ can be replaced with Xw for regression. X is a matrix with a size of (n samples, num features), and w is a column vector with with a size of (num features, 1).

The next step is to simplify the equation before taking the derivative. Notice that w and X switch positions to ensure their multiplication is still valid: (1, features) x (num features, n samples) = (1, n samples).

These error calculations can then be multiplied together.

Notice that the third term can be rewritten by transposing it, following the third property on this page. Then, it can be added to the second term.

Now, the partial derivative of the MSE can be taken with respect to the weight.

This is equivalent to taking the derivative of each term:

Each term that is not w can be treated as a constant. The derivative of each component can be computed using these rules:

The first term in the equation follows the fourth rule and becomes zero. The second term follows the first rule, and the third term follows the third rule.

This equation can be used in gradient descent to simultaneously calculate all the partial derivatives:

Conclusion

The gradients for the variations of the MSE cost function can be easily used in gradient descent by plugging them into the formula. An example of gradient descent can be found in An Introduction to Machine Learning in Python: Simple Linear Regression.

Please don’t forget to like and follow! :)