Gradient Descent-A Roadmap to the Bottom of the Hill

Hrithick Gokul Yeddula
Nerd For Tech
Published in
6 min readMay 4, 2021
Photo by gaspar manuel zaldo on Unsplash

Introduction

Gradient descent is an iterative optimization algorithm that is used for finding the local/global minimum of a differential function. Since the main goal of this function is to find the minimum, it is widely used in machine learning to find the perfect parameters that minimize the cost value.

In this article, we will see how this algorithm works, the mathematics behind this algorithm, and the types of gradient descent.

How exactly is it used in machine learning?

Let’s say we have a task. The goal is to create an algorithm that can predict whether the given image of an animal is a cat or not — for each input image, the algorithm should give an output 0(if it’s not a cat) or 1(if it’s a cat). First, we’ll train the algorithm with several pictures of cats and other animals so that the algorithm will create its logic to predict new images. Now using this algorithm we’ll predict several new images and then finally calculate the average cost, which tells how wrong is our prediction.

What if the algorithm performed poorly? How could we make it perform better?

The algorithm depends on 2 things: The input values with which we train the algorithm, and the parameters of the algorithm. So, if the algorithm performed badly, it is because of these two factors. So, we can either change the inputs or the parameters, until we get a low cost.

In either case, we need to minimize the cost function. So, This is where gradient descent comes in handy. We’ll use gradient descent to find the parameters where the cost is minimum. Let us see how.

θ0 and θ1 are the parameters, x is the input

If you want to learn more about the cost function, training an algorithm, etc. Check out my other article which gives you a simple introduction to machine learning.

Walking down the Hill

Imagine you are standing on top of a valley and you somehow need to get to the bottom, where there is a river. Since the pathway to the bottom is so steep, you would walk faster where the slope is steeper and slower where the slope is less steep. So you decide how big the next step is by looking at your current position, right? This is how gradient descent works too. It would reach the minimum by “walking down the hill” and decides to take either a big step or a small one, depending on the slope.

“Walking down the hill”

The Slope of a function

Slope is calculated by finding the ratio of the “vertical change” to the “horizontal change” between (any) two distinct points on a line.

We know that the derivative of a function of a real variable measures the amount of change of the value A with respect to a change in Value B. Here if A is the “vertical change” and B is the “horizontal change”, then the derivative would give us the slope.

If you didn’t know that, Let me explain it quickly. Others can skip to this part.

Consider a function f(x) = x². The graph of the function would look like this.

Let’s zoom in and see what a tangent line looks like.

This is just a rough diagram, since the height of d(f) = 0.010 will be ten times longer than the width d(x) = 0.001

This diagram shows how the derivative gives the slope of the tangent line.

When x = 5, f(x) = 25.

In the image we can see, if we move x from 5 to 5.001, then f(x) moves from 25 to f(5.001) = (5.001)² = 25.010 — Value A (f(x)) moved 10 times greater than Value B (x), therefore the change of Value A with respect to the change in Value B = d(f)/d(x) = 0.010/0.001 = 10. Hence the slope is 10.

Similarly, let’s take a point x = 2, then f(x) = 4.

Now if we move x from 4 to 4.001, then f(x) becomes f(2.001) = (2.002)² = 4.004 — f(x) moved 4 times greater than x, therefore the change of Value A with respect to the change in Value B = d(f)/d(x) = 0.004/0.001 = 4. Hence the slope is 4.

In both the cases, the slope value is twice the x value (when x = 5, slope = 10; when x = 2, slope = 4). Therefore we can assume that the derivative of the function f(x) = x², is d(f)/d(x) = 2(x).

Since f(x) isn’t a straight line, the slope of the tangent line will be different at each point of the function.

Gradient Descent

We’ve seen how gradient descent is used in machine learning, how it finds the minimum of the function, and what is a slope. Now let’s put all this together and see the math behind gradient descent — the formula.

The cost function

For each parameter θ0, θ1 we’ll find its derivative d(θj) (slope) with respect to the cost function J(θ0, θ1).

Now, for each parameter θj, if the slope d(θj) is not close to 0, then we’ll update the parameter θj by subtracting it with its slope d(θj). This means, if the place you are standing is steep, then you haven’t reached the bottom yet. So, you’ll walk one step forward towards the bottom and repeat this until the slope becomes 0.

Updating θj for every m training examples

(hθ(x^(i))-y^(i))xj^(i) is the derivate d(θj).

If we repeat this process several times, the cost with respect to the parameters will eventually reach the minimum, the slope would become 0 and you would reach the river.

Photo by Anthony DELANOIX on Unsplash

Learning Rate

α is called the learning rate. We usually set the learning rate to be a small value, or else the cost would take a very long step and end up diverging towards the top.

Image by Author

The name of this particular algorithm is also called Batch Gradient Descent. The term batch refers to the fact that we are looking at all of the training examples at the time.

Stochastic Gradient Descent

The problem with batch gradient descent is that, if the number of training examples is large, then computing the derivative term d(θj) can be very expensive. Because this requires summing over all m examples.

Let’s say we have millions of training examples. So to find the derivative of the parameters with respect to the cost function, we would go through all the million training examples, accumulate these sums, and having done all that work just for moving one step closer to the minimum. And so it could take a long time to get the attributes to converge.

Stochastic Gradient Descent is a lot like batch gradient descent, but rather than waiting to sum up the parameters over all m training examples, it updates the parameter using just one single training example and starts to make progress in improving the parameters and moving towards the global minimum.

Updating θj for each m training examples

Conclusion

Now you should have got a better understanding on:

  1. Why gradient descent is used in machine learning?
  2. The Mathematics behind it.
  3. The Difference between Batch and Stochastic gradient descent.

Gradient descent is so important in machine learning. It will help in finding the best set of parameters that would minimize the cost value, therefore leading to a good result.

Thank you!!

I hope you found it helpful. Thank you for taking the time to read this far. If you have any suggestions, leave a comment. Visit my LinkedIn to know more about me and my work.

--

--