Explaining mathematical part of Gradient descent algorithm

Adarsh Pathak
Mathematics and AI
Published in
4 min readApr 19, 2020

One of the most popular term you will often listen in machine learning is optimisation or optimising cost function. We will start our machine learning journey with gradient descent optimisation. So what it is? How it is so useful in machine learning?

Gradient descent example

Gradient Descent is an optimisation algorithm used for minimising cost of a function.

So still not clear what is Gradient descent algorithm? Let’s understand this algorithm with an example. Suppose you are paragliding and you found some problem in your parasuite. So you decide to land on any nearby hill to avoid any damage. Now to are at the top of a hill and you know that if you are able to reach bottom of that hill someone will help to reach home. So how will you find the bottom point of that hill? To reach the bottom point of that hill you have to first decide in which direction you have to move. Suppose there is a junction of three roads. One which from which you are coming, second is going down and third is again going at another top. If you are running very fast and not able to see difference it is possible that you choose again a wrong path. At the end of that wrong path you observe that you are again at the top of the hill but this time at second peak. So to avoid this mistake you have to decide how fast you have to move so that you can avoid these mistakes or in simple words you have to decide your step size.

Gradient descent algorithm is also searching for lowest point of the hill. With two help of direction and step.

Gradient descent example

In above figure gradient descent algorithm is trying to reach at point to from point A with the help of direction and step size. Arrow is step size and arrow direction is the direction of that algorithm.

Gradient descent formula:

NextPosition = CurrentPosition — stepSize * Direction

Above equation can be written as:

J := J — alpha * dJ

Both equations are same you can compare each term.

alpha -> step size

dJ -> direction

Let’s take another example to understand this concept mathematically.

Suppose you have a line Y = mx+c ; you have x1 = 10 , Y1= 6 and x2 = 20, Y2 = 11 ; Your task is to find best value of m and c which gives least error to calculate Y;

Y_hat is the calculated value of Y using random value of m and c.

error = (Y_hat - Y )²

We are using mean square method because we want to focus on only difference between real value and calculated value not its sign wether it is negative or positive. Here our aim is to minimise this error. error = ((mx1 + c — Y1)² +(mx2 + c — Y2)²)/2;

How will you calculate the best value of m and so that error is minimum? Yes differentiate this equation with respect to m and c.

dm = d(error)/dm = (mx1+c — Y1)*x1 + (mx2+c — Y2)*x2

dc = f(error)/dc = (mx1+c — Y1) + (mx2+c — Y2)

dm and dc will tell us direction towards which our m and c value will go to minimise error.

m := m — alpha * dm

c := c — alpha *dc

here alpha tells us how fast we have to move and dm and dc tells us in which direction we have to move to get better result.

Initially assume any value of m, c and alpha.

say m = 8, c = 3 , alpha = 0.0001; x1 = 10 , Y1= 6, x2 = 20, Y2 = 11

error = ((mx1 + c — Y1)² +(mx2 + c — Y2)²)/2;

error = 14516.5

m := 8–0.001 * ((8*10+3–6)*10 + (8*20+3–11)*20)

c := 3–0.1*((8*10+3–6)+(8*20+3–11))

error: 14516.5
m: 7.619
c: 2.978243

now calculate new error with new m and c value;

if you repeat this process 10 times you will get:

1th iteration
error: 13096.447052877047
m: 7.257115271
c: 2.957576005587

2th iteration
error: 11815.310801857855
m: 6.913386779433239
c: 2.937944330047583

3th iteration
error: 10659.501720246768
m: 6.586903607471434
c: 2.9192960303591593

4th iteration
error: 9616.759676844307
m: 6.276800539006786
c: 2.901581769536067

5th iteration
error: 8676.023887830148
m: 5.982255766747838
c: 2.884754685881916

6th iteration
error: 7827.315590603369
m: 5.702488714352801
c: 2.8687702688016814

7th iteration
error: 7061.632195054385
m: 5.436757967828756
c: 2.8535862408444346

8th iteration
error: 6370.85178948882
m: 5.184359310714784
c: 2.839162445664121

9th iteration
error: 5747.646988259523
m: 4.9446238578420525
c: 2.825460741601462

10th iteration
error: 5185.4072072540985
m: 4.716916282725146
c: 2.8124449006049663

this is very bad error nut you you repeat this process for 100 times you will get:

95th iteration
error: 1.0658150106056041
m: 0.46037831265621054
c: 2.5670644456671083

96th iteration
error: 0.9854683784524862
m: 0.4576582036863987
c: 2.5668780581669157

97th iteration
error: 0.9129808303159086
m: 0.455074659327578
c: 2.5666994585772995

98th iteration
error: 0.8475835449031676
m: 0.4526208279854672
c: 2.566528256201628

99th iteration
error: 0.7885829111876942
m: 0.45029020181758894
c: 2.5663640799449348

and your new Y value with this m and c is : 7.069266098120824 which is not close to our expected result.

Applying this algorithm on two values of x did not give better result.But if we are working on more than 1000 values of x then we have to find m and c which gives best result of Y then this algorithm will work fine.

In this article i am not going to coding part here i tried to explain only mathematical part of this algorithm. I will write one more article to explain more detail of it.

Hope you like this article and get more intuition of gradient descent algorithm. If you like this algorithm give a clap and share it with your friends. Follow for more articles on mathematics of deep learning.

--

--