Shreyan Goswami
Jul 10, 2017 · 6 min read

A hidden truth in calculus. The method of gradient descent.

When I first looked up gradient descent on Wikipedia in college I felt that the symbols on that page fit perfectly on the walls of an Egyptian pyramid. I had to struggle a lot to understand this method but I finally got a little bit of intuition behind the technique. In this post I will try to give the readers the same intuition.

My main intention for writing this is to serve it as some soul food for readers who want to know application of calculus. We go through high school learning basic tools in calculus. But we hardly see any real applications of such basic tools. The technique I show here is a bit more advanced. But it has lots of uses. I hope this post gives an intuition of the technique that I am going to write about here.

The method of gradient descent is a very popular optimisation technique. There are pre-defined functions for gradient descent in MATLAB and Octave(fminunc).

I am currently taking a course on machine learning on Coursera. When I learn something again I get a deeper understanding. I had already taken a course of neural network in college but at that time I had a lot on my plate. I scraped by the course learning whatever the professor gave with little time to think over any deeper meaning that was present.

Now before taking a deep dive let me set the context. This context will give the readers an idea on why we need gradient descent.


When I learnt calculus in high school I was taught in that calculus measures the rate at which a function changes.

A function takes a parameter or a set of parameters and always outputs a parameter or a set of parameters which is unique for the given input.

Calculus allows us to compute the rate at which the function is changes. One caveat to this is that if we analyse the rate we can find the value at which the function reaches a maximum or minimum. At these values the rate at which the function changes is 0.

A simple function that we would would look like this:

Simple linear equation

In machine learning it looks like this:

“Simple” equation in ML

In ML the input values cannot be changed i.e. the x’s cannot be changed.What we can change is the θ values. And that is the main aim in ML. To find the set of θ so that the value of a function is minimum. It’s the reverse of what is usually done but the concept is same.

If we follow the normal way of finding a minima of taking partial derivative of the function w.r.t each of the variables θs, equate it to 0 and then solve for the different θs it becomes a very daunting task. If there are 100 θs we will have to solve 100 equations. Solving simultaneous equations is difficult. We need a better way with which we can search for the values of input for which the function is minimum. Gradient descent is one such way.

Let’s start by taking a very simple function which has just one variable.

Quadratic equation

To understand the fundamental idea behind gradient descent, it helps to graph the function.

Quadratic curve

Our main aim here is to find the value of x for which y is minimum. Since the equation has a single variable it is evident from the graph what is the minima.

Function with two variables

To visualise a function with many variables a traditional plot won’t be enough. A better visualisation is a contour plot. The contour plot is a projection of the 3d graph on the 2d surface. We can see the contour plot at the bottom of the figure. They are the “shadows” of the graph. Most explanations of gradient descent uses contour plot to visualise how the algorithm is working.

It is very easy to calculate the rate of change of the function. The derivative of the quadratic function is below.

Derivative of the quadratic function

Let’s take a point on the curve, say x=3, then y=9. The derivative at x=2 is 6. Since the derivative is positive the value at x=3 is increasing. Hence if we take a very small number d and add it to 3 then the value of the function at x=3 is less than the value of the function at x=3+δ.

But what about the value of the function at 3-δ? The value of the function at x=3-δ is less than the value of the function at x=3. Hence if we go to 3-δ we will get a value which is less than the value at x=3. Again at x=3-δ if we subtract δ then we will be at a value less than at x=3-δ. We can continue this process of deducting δ from x, till we reach a value of y that is more than the previous value. We can then be sure that the previous value is a minimum(or close to a minimum).

We have to decide what is the value of δ i.e by what amount we have to change the value of x every time till we reach the value of y for which x is minimum. The value of δ cannot be a constant that we use for every function that comes our way. It must depend on the function.

Recall that the magnitude of the derivative will tell us by how much the function is changing. The sign of the derivative actually tells us if the function is increasing or decreasing. Hence depending on whether the function is increasing or decreasing we need to move to or from x.

So our δ is directly proportional to the value of the derivative at that point. Let’s break out a bit of math.

Two simple steps

The constant of proportionality is called the step. This is an important constant which actually determines how much from x we deviate. If the step is too high, f(x) will overshoot the minima. If the step is very less then it will take more time to find the minima. A good technique is to take the step as between 0.1 to 0.3. If we are overshooting the minima decrease the step by a tenth and try again.

Depending on whether the derivative is positive or negative we will be moving x in the opposite direction. A negative sign takes care of that.

Behold the greatness

We finally have our formula. It’s always a good habit to get a second look of the formula that we derive. If the derivative is negative for a particular value of x, df(x)/dx is negative and x will move in the direction “towards” where f(x) is decreasing.

To apply gradient descent repeat this process till the function stops decreasing. So at every iteration check if the previous value of function is more than the current value of the function.

I urge readers to try out the method of gradient descent in Octave or in a language they are familiar with.


That’s it for now. If you liked it please like and share. If you didn’t like it let me know what could be done better.

Error: Summation is from 1 to n and not from 0 to n in the second equation

Shreyan Goswami

Written by

I am a lifelong learner. In my free time I read, play video games and watch TV shows. I write code for a living.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade