Gradient Descent

An intuitive introduction with animations

5 min readApr 11, 2023

Gradient Descent is an optimization technique for finding local minima of functions. But what is it optimizing exactly?

Sometimes we can use calculus to find critical points of functions: we can take the derivative and find all values that make the derivative 0. But we may not always be able to do this: try to find the critical points of f(x) = x⁴ — 3x² — x — 1 using this method for example.

To find the minimum of a function you can evaluate the function at all possible input values and report the minimum. In practice, since you can’t evaluate all values you must limit yourself to a range of input values:

There isn’t a good way to pick a range without knowing more about the function so we often just pick one arbitrarily. The larger it is the more values we can observe but the more computationally expensive this becomes.

If you think of this range as a window within which you can observe the function values, then you can observe more of the function by repeatedly sliding the window around:

By strategically choosing the direction in which to move the window, we can find smaller and smaller values. For example, if the smallest value is, say, in the top right corner of the window, you would move the window to the top right. If the smallest value is within the window then, great, we’ve found a minimum!

This strategy works even if we reduce the size of the window. Which actually makes things even more efficient because we only really cared about points at the border of the window:

What happens if we reduce the size of the window to the smallest possible size it can be (i.e. a point)? How can we know where the function is decreasing if we only have a single point to work with?

Recall derivatives represent the instantaneous rate of change of a function! In multiple dimensions, we call the gradient the vector that is the combination of axis-wise partial derivatives (i.e. the rates of change in each of the coordinate directions).

Since the gradient is the direction of steepest rate of change, by following the gradient you would eventually end up at a maxima of the function should it exist. Conversely, if you follow the direction opposite to the gradient, you would eventually end up at a minima of the function should it exist.

Assuming we are evaluating the function at a specific point p, to follow the gradient means to nudge p by a certain amount (called a step size, or learning rate) in that direction.

Gradient Descent Algorithm

1. Define a step size 𝛂 (tuning parameter)
   and a number of iterations (called epochs)
2. Initialize p to be random
3. pnew = - 𝛂 ∇fp + p
4. p 🠄 pnew 
5. Repeat 3 & 4 for defined number of epochs

Implementation

Here is template code that I used to generate these animations. I recommend you try it out and make changes to the function (and its gradient), the initial point, the learning rate etc. to get a better understanding of how it all works:

import numpy as np
from PIL import Image as im
import matplotlib.pyplot as plt

TEMPFILE = "temp.png"

def snap(x, y, pts, losses, grad):
    fig = plt.figure(figsize =(14, 9))
    ax = plt.axes(projection ='3d')
    ax.view_init(20, -20)
    ax.plot_surface(x, y, loss(np.array([x, y])), color='r', alpha=.4)
    ax.plot(np.array(pts)[:,0], np.array(pts)[:,1], losses, 'o-', c='b', markersize=10, zorder=10)
    ax.plot(np.array(pts)[-1,0], np.array(pts)[-1,1], -1, 'o-', c='b', alpha=.5, markersize=7, zorder=10)
    
    # Plot Gradient Vector
    X, Y, Z = [pts[-1][0]], [pts[-1][1]], [-1]
    U, V, W = [-grad[0]], [-grad[1]], [0]
    ax.quiver(X, Y, Z, U, V, W, color='g')
    fig.savefig(TEMPFILE)
    plt.close()
    return im.fromarray(np.asarray(im.open(TEMPFILE)))

def loss(x):
    return np.sin(sum(x**2)) # change this

def gradient(x):
    return 2 * x * np.cos(sum(x**2)) # change this

def gradient_descent(x, y, init, learning_rate, epochs):
    images, losses, pts = [], [loss(init)], [init]
    for _ in range(epochs):
        grad = gradient(init)
        images.append(snap(x, y, pts, losses, grad))
        init = init - learning_rate * grad
        losses.append(loss(init))
        pts.append(init)
    return images

init = np.array([-.5, -.5]) # change this
learning_rate = 1.394 # change this
x, y = np.meshgrid(np.arange(-2, 2, 0.1), np.arange(-2, 2, 0.1)) # change this
images = gradient_descent(x, y, init, learning_rate, 12)

images[0].save(
    'gradient_descent.gif',
    optimize=False,
    save_all=True,
    append_images=images[1:],
    loop=0,
    duration=500
)

Limitations and Considerations

1. Locality

The minimum or maximum found with this method will always be a local minimum or maximum. That is, there may exist a smaller minimum or a greater maximum elsewhere.

2. Initial Condition

The initial point largely influences which local minimum you find

3. Choice of Step Size

It’s not just alpha controlling how big of a nudge the point p gets, but also the size of the gradient. You can see above how the steps seem to become smaller as we get closer to the minimum. This is because the gradient is smaller when closer to a critical point.

With a certain combination of alpha and gradient, you can even end up going the “wrong” way: