Gradient Descent Algorithm | Convergence, Performance, Behaviour

Adarsh Pathak
Mathematics and AI
Published in
3 min readApr 20, 2020

Lets move one step further and explore deep mathematics of gradient descent algorithm. This article is part 2 of my previous article so please read article before this if you had not read that yet. If you are familiar with gradient descent algorithm them you can skip part-1. here is the link of part-1 ( https://medium.com/mathematics-and-ai/mathematics-and-ai-optimisation-algorithms-gradient-descent-algorithm-781e350027e1 )

In previous article we had seen that how does it calculate values so that our function gives least error. Now let’s explore gradient descent algorithm in more detail.

Why does this algorithm makes sense?

This is our last example we had seen. So the first question is why does is goes from A to B? What will happen if A was at opposite part(increasing cure) , will it still decreases to B?

A := A— alpha * dA we are using this formula in gradient descent algorithm.

dA = df/dA = slop of function

Gradient descent explaining

if we start from point A then slope at that point will be negative so dA is negative and A := A+alpha * (dA) so A will move in forward direction if we start from point B its slope will be positive B:=B-alpha*dB will decrease and it will again move towards minima point.

When will it stop moving?

If dA or dB will be equal to zero or in other words when we reach at any optimal point A or B will stop moving and that will be our optimal solution.

How does alpha can affect the algorithm?

What if alpha is very large? like alpha = 80

alpha is large

If alpha is very large then it will show very unusual behaviour. Our function will come time converse some time diverse. It may not be converse to optimal value.

What if alpha is too small? alpha = 0.00001

small value of alpha

If alpha is very small it will converse at very slow rate. It may take couple of minutes to converses at optimal point.

So performance of gradient descent algorithm mostly depends upon the value of alpha. Most common value of alpha is 0.1 , 1 ,0.01, 0.05 . You can set these values.

This algorithms does not guarantees for absolute minima. This may be the biggest drawback of this algorithm. There are more advance optimisation algorithms used in deep learning. Which i will explain later in other articles.

One other drawback of this algorithm is that is uses whole dataset to calculate one value of dA or dB. If our dataset is too large this algorithm does works well.

J, db , dW calculation
W and b calculation

You can put w and b values in equation of line to find more accurate prediction. Y = w*x + b

J is our error function which is often referred as cost function.

Hope you understand these basic concepts which we will implement in machine learning algorithms further. I will keep writing new articles daily. You can follow me for my new articles. Mathematics and AI is my collection where i am uploading articles. So follow this page also. If this article helps you to understand gradient descent algorithm then give claps . Thank you…..

--

--