Gradient Descent: How Machine Learning Learns (Part I)

Sameer T.
Hoyalytics
Published in
7 min readMar 18, 2023

How does a linear regression model determine the optimal slope and coefficient? You might have heard that it chooses the values that minimize its error function, but how does it go about that?

In this article, we’ll cover the algorithm that runs under the hood of many machine learning, deep learning, and general optimization processes of convex functions– gradient descent. It may be helpful to first learn the concepts in 2D, and with a simple function like a parabola, but the visual examples below should help!

Intuition

Imagine that the valley pictured below is our loss function, some measure of how off our predictions are from the true values. We aim to minimize the function to create the best model (one that predicts as close to the actual values as possible).

Optimizing the loss function is analogous to reaching the bottom of the valley!

Analogy for loss function of a model

To visualize this, I’ve superimposed x, y, and z = f(x, y) axes on our image to symbolize some loss function f(x, y). The arrow represents the goal we’re trying to reach: the bottom of the valley, achieved at the coordinate (x, y) that results in the minimum f(x, y) output.

Multivariable interpretation of reaching the bottom of the valley by minimizing loss function output f(x,y)

Understanding Gradient Vectors

Logically, the quickest way to move up or down the mountain is to follow the slope of the mountain itself.

In 2D, the rate of change in some quantity (height, in this case) at a point will simply be the derivative (or instantaneous slope) of y with respect to x at that point. This would be a scalar.

In 3D, you move in 2 ways: front-back (x-axis in this picture) OR side-to-side (y-axis). Following the natural slope of the mountain requires your movement to have both x and y components, like the vectors drawn below.

Natural slope of mountain — gradient vectors

In 3D, the rate of change in height must be calculated individually for each component x and y:

  • By holding y constant, we find the rate of change in height with respect to only x at a point via ∂f/∂x, the partial derivative with respect to x
  • By holding x constant, we find the rate of change in height with respect to only y at a point via ∂f/∂y, the partial derivative with respect to y

In conclusion, if you moved in the direction of the gradient vector (∂f/∂x, ∂f/∂y), you would be following the slope of the mountain and be moving in the direction of steepest ascent or descent depending on which side of the valley you were in.

Now, back to what we’re trying to find: the (x, y) inputs that correspond to the minimum z of the bottom of the valley. We always want to be changing (x, y) so that we’re closer than when we started.

Update Rule

The distinction between (x,y) and z = f(x, y) here is important. For example, in the top left corner of the picture, both the x coordinate and the y coordinate are less than the (x, y) corresponding to the minimum. We need to increase both x, y to decrease z.

  • Generalizing, for a given point (x, y) on the right side of the picture, the mountain slopes upward– the rate of change in f(x, y) is positive with respect to z. If we added this gradient vector that points away from the minimum to our (x, y), we would be increasing z and going upwards away from our goal.
  • This next case might be more confusing, but if we took an (x, y) on the left side, the rate of change in f(x, y) with respect to z is negative. This gradient vector points toward the minimum, but adding a negative vector to (x, y) is the same as adding a positive vector in the opposite direction. This would also increase z and mean moving upwards away from our goal.

So on either side of the valley, following the direction of the gradient will send us in the exact opposite direction of our goal at the bottom of the valley.

What if we reversed the direction then, and moved with the negative gradient?

On the left side of the image, the negative gradient vectors would become positive and increase (x, y), stepping closer to the bottom. On the right, the positive gradient vectors would become negative and decrease (x, y), also stepping toward the bottom. The update rule of gradient descent is derived from this notion:

x = x — a * ∂f/∂x evaluated at (x, y)

y = y — a * ∂f/∂y evaluated at (x, y)

These equations simply say that at a given initialized point, we calculate the rate of change of f(x, y) with respect to x only and with respect to y only. Then, we scale these partials by some learning rate denoted alpha (more on this later), and subtract from our current point (x, y).

In our analogy, we measure the slope of the mountain, scale it by our learning rate, and backtrack from where we are.

Importantly, note that we must calculate both partials first, rather than calculating ∂f/∂x and modifying x and then finding ∂f/∂y. Otherwise, we would be finding the component rates of change for 2 different points (following the slope of the mountain at 2 different places).

And that’s it! We’ve covered all that the 3D gradient descent algorithm does: it initializes a point (x, y, f(x,y)), calculates the partials, updates x and y simultaneously to move in the direction of the minimum, and repeats this process until convergence. Different models will define convergence differently: some simply run 100 iterations, while others continue until an update of the coordinates decreases f(x, y) by less than .001.

Convergence

As you can imagine based on the aforementioned conditions for convergence, where we initialize our first point can make a huge difference in model results.

  • For example, we may converge to a local minimum rather than the absolute minimum if a ridge in the valley was close to where we started. For this reason, some algorithms use multi-start testing, meaning there are multiple initialization points. For the purpose of our simple algorithm from scratch, we will avoid analyzing functions with several local minimums.
  • Additionally, if you initialized on the edge of the valley, maybe you’d never get going in the correct direction for a minimum and diverge. For our algorithm, we’ll initialize at a reasonable estimate of our goal.

Perhaps most important for our model’s performance and ability to converge is what we choose for the learning rate. In my experience, 3 scenarios arise from a suboptimal alpha:

Too small of a learning rate means the model will take much longer to converge to our goal, analogous to crawling down the mountain with baby steps and dying of starvation.

The algorithm could also potentially converge too early. If the gradient at the point is small ie. the slope is gradual, a small number multiplied by a small alpha may fail to change the coordinate much and by extension fail to change f(x, y) by more than the .001 that the computer checks for convergence.

Example of an alpha that’s too small

Too large of a learning rate is inefficient because it can cause the model to shoot past the goal. You’re taking the long way, bouncing back and forth between the sides of the valley

Example of an alpha that’s too large

An excessively large learning rate can even cause the model to diverge completely, failing to ever find a solution for the minimum. Think of this as getting lost and leaving the valley entirely.

Example of an alpha that’s much too large

A good gradient descent algorithm will smoothly and efficiently arrive at a minimum, ideally the global minimum:

Example of a good choice for alpha!

Conclusion

To bring all this back to computer science applications, machine learning models train themselves by tweaking their parameters to minimize some loss function that measures the error of their predictions on training data. Better model parameters = better predictions = smaller loss function output and less error.

Take the example of linear regression: we’re searching for a regression slope and intercept that minimizes the sum of squared residuals:

Linear Regression loss function, sum of squared differences between predicted and actual values

This is just like our examples above, except instead of some function f(x, y) taking a physical coordinate in the valley as its input, our f(m, b) calculates the sum of squared residuals from a given regression’s parameters — its slope and intercept.

In Part II, we’ll put everything we’ve learned together with code and try to optimize a linear regression ourselves!

References

I’d like to thank the following articles for helping me build an understanding of gradient descent to put this together!

--

--

Sameer T.
Hoyalytics

Studying CS & Stats | Exploring Data Science & Machine Learning