Gradient Descent: Explained!

Rowan Curry
4 min readNov 19, 2021

--

Gradient descent is a popular and effective optimization strategy used when training data-based models. Gradient descent’s popularity is due to the fact that it can be combined with any algorithm. Read on to learn more!

Before we get into gradient descent as an optimization strategy, we’ll need to review what a gradient is.

gradient vectors of the surface of cos(x)sin(y)

You might remember the term gradient from calculus as describing the slope of a function. For our purposes, the definition is essentially the same.

A gradient simply measures the change in all weights with regard to the change in error. This means that the lower the gradient, the flatter the slope, and the slower a model learns. The opposite is also true: the higher the gradient, the steeper the slope, and the faster a model will learn.

Now we can define gradient descent. This optimization algorithm finds the values of a function’s parameters that minimize the cost function. The cost function is simply the method of evaluation chosen to communicate the performance of an algorithm. For a more in depth look at cost functions, check out this article.

Gradient descent finds these optimal parameters by determining the local minimum of a differentiable function through an iterative process. The following equation describes gradient descent:

Gradient Descent: A Simplified Equation

In this equation:

  1. b describes the next iteration
  2. a describes the current iteration: gradient descent starts with random values of a and b and then keeps updating these values based on the first-order partial derivatives
  3. the minus sign describes the minimization aspect of gradient descent
  4. 𝛾 describes the learning rate, which will be discussed more in depth further on in the article
  5. the gradient term describes the direction of steepest descent

The learning rate, which is determined by the builder of the model, defines how large (or small) the iterative steps gradient descent takes in the direction of the local minimum will be.

learning rate visualizations thanks to Jeremy Jordan

Determining an adequate learning rate is important. If the steps are too large, gradient descent may never reach the local minimum, as displayed above. If the steps are too small, gradient descent will eventually reach the local minimum but it will take much too long to get there.

You can check if your learning rate is doing well by plotting the number of iterations on the x-axis and the value of your cost function on the y-axis as the optimization runs:

learning rate visualization thanks to Aditya Rakhecha

This is an excellent way to immediately spot how appropriate your learning rate is. If gradient descent is behaving in an effective way, the cost function will decrease after every iteration. When the cost function stops decreasing, we say that gradient descent has converged.

While there are some algorithms that can automatically tell you when convergence has occurred, they require you to define a convergence threshold beforehand, which is immensely difficult to predict. For this reason, plots like the one above are your best bet for examining the performance of your gradient descent optimization, as well as for figuring out when it has converged.

There are three different types of gradient descent:

  1. Batch Gradient Descent: This is classical gradient descent, also referred to as “vanilla” gradient descent. This method calculates the error of each example within the training dataset, but only updates the model after all training samples have been evaluated. This whole process is one cycle (training epoch). While this process is computationally efficient (it produces a stable convergence and a stable error gradient), this efficiency sometimes leads to a state of convergence that isn’t the absolute best a model can do.
  2. Stochastic Gradient Descent (SGD): This method of gradient descent calculates the error of each training observation individually. SGD uses each observation to estimate the gradient, and then takes a step in that direction. While each individual observation provides a poor estimate of the true gradient, if there is enough randomness, the parameters will converge to an adequate global estimate. SGD also works very well for large datasets: since it only considers one observation at a time, it can handle extremely large datasets that don’t fit in memory.
  3. Mini Batch Gradient Descent: This method of gradient descent is the best of both worlds. It splits the training dataset into small batch sizes and updates gradient descent for each of these batches. This creates a balance between the efficiency of batch gradient descent and the power of SGD. This is the go-to algorithm when training a neural network, as well as the most common type of gradient descent associated with deep learning.

So there we have it! Gradient Descent, explained. Stay tuned for an article (from me) on how to implement gradient descent — and in the meantime, check out these great resources:

Implementing Gradient Descent in Python

Implementing Gradient Descent Optimization from Scratch

Linear Regression using Gradient Descent

A Step-by-Step Implementation of Gradient Descent and Backpropagation

--

--

Rowan Curry

Data Scientist. Very excited about all things data. All views are my own.