Gradient Descent

Ana P
The Startup
Published in
5 min readFeb 27, 2020
Image by me from Tara Canyon

One of the most used mathematical algorithms in data science and machine learning is gradient descent. Although we don’t manually code it, it is the underlying mathematical algorithm for a wide variety of models. It helps us update our parameters particular to the type of model we are running, from a simple linear regression to neural networks. While different optimization techniques are specific to the type of model you are using, gradient descent can optimize almost all of them. Gradient descent is a bit different when working with neural networks, so for now I will focus on the basic understanding.

Gradient Descent equation

Before diving in, I’d like to review some key concepts that will help us understand gradient descent. A derivative is an important calculus concept that is just as important in gradient descent. There are two ways to approach this, either taking the derivative of a point or the derivative of a whole function. One is the instantaneous rate of change of a function; for example, if we have a curved line and a point on the curved line. We would also have another line, known as the tangent line, which is a line that just touches the curve at exactly that point. The slope of this tangent line is the derivative. While we can have positive and negative slopes from these tangent lines, for machine learning we are greatly interested in the “0” slope line. These “0” slopes denote the maxima or minima. There are also two types of maxima and minima, global and local, which we will cover later.

Another important term is optimization, which simply put finds the maximum or minimum depending on the problem. For example, in support vector machines, we are looking to maximize the margin, while in a linear regression we are looking to minimize our errors. In this post I am focusing on minimizing, but maximizing is just as easy to find if you have the minimum — you just take the negative of that problem.

The next important concept is a loss function. A loss function helps us describe the discrepancy between our outputs and our target value, where we are looking to minimize the error as much as possible. This is not to be confused with utility functions, which are the opposite. One would want to maximize the utility function, for example a higher accuracy score, whereas for a loss function like the MSE we want to be as close to 0 as possible.

Finally, our last concept is convexity. For the purpose of this post, we are going to describe a convex function as any u-shaped function. This is one major point where machine learning algorithms differ with neural nets, as neural nets are non-convex functions with greater complexity.

Combining these concepts together will help us understand the basics of the gradient descent algorithm. Gradient descent is an algorithm that helps us optimize a function through an iterative process when solving for the best parameters. “Iterative” here means it repeats until it stops. To start, we make an initial guess to give the gradient descent something to improve upon. Often, this algorithm is described as getting to the bottom of the valley, to that lowest point we want to know.

Source: dsdeepdive.blogspot.com/2016/03/

There are a few clues we can use in order to know in which direction to go down this slope. Given the initial guess, if the slope of the tangent of this point is positive, you would want to move to the left, and if the slope of the tangent of this point was negative you would want to move to the right. So, one is moving in the opposite direction of the slope. By determining in which direction our slope should move, one will also factor in the step size. The step size, also known as the learning rate or “alpha” is an important number that also has potential pitfalls with this algorithm. Thinking about the hill again, if it is very steep, we have a long way to go to the bottom (global minimum), so one would want to take larger steps towards the bottom. But by continuing to take big steps, one might overshoot the minimum and then the algorithm would continue bouncing around in the “valley”, continuously getting worse as the step and gradient continue to get larger. Another potential problem is too small of a step size — the algorithm would take too long to find the minimum and would either take too long or it would just come to a halt before reaching it.

Source: https://srdas.github.io/DLBook/GradientDescentTechniques.html

Let’s say one converged to a local minimum instead of a global minimum. Depending on the initial guess and the shape of the loss function, one can get different answers, which can also happen quite frequently with neural nets. Finally, the function itself can be too flat, where the slope is very gradual for a while, that it plateau’s before finding the minimum.

Source: oreilly.com

There are a variety of problems, often which have to do with the step size, as GD is sensitive to the learning rate. There are a couple of techniques that one can use in order to avoid these pitfalls.

Remember that GD identifies the optimal value by taking big steps when it’s farther away and smaller steps when the slope is closer to reaching a minima. GD continues to take steps from the initial guess until it reaches the best value. Consider a step size related to the slope, where we need to make sure we are not overestimating or underestimating our step. Another consideration is to change the initial guess several times and once we start receiving the same results multiple times, we can confidently assume that we have found the global minimum. Another solution is to use a different variety of GD such as adaptive gradient descent or stochastic gradient descent.

In essence GD first takes the derivative of each of the parameters in the loss function from the initial guess. It starts to move “down hill” based on the step size, which is the slope times the learning rate. Next the new parameters is calculated given the old parameter and step size. This is where the iterative process beings. The algorithm goes back to plugging in the new parameters in the derivative, until one reaches a very small step size or the GD comes to a halt. Although this process can be timely when working with millions and billions of data points, gradient descent is still a very powerful and useful algorithm.

--

--