Gradient Descent for Linear Regression

6 min readOct 6, 2017

This is an article about gradient descent and its applications on linear regression. In this article i have decided to write about the explanation of the so called “Gradient Descent” algorithm by using some intuitive images and memes to get you involved with this topic. I am a great fan of Andrew Ng, cause i have started learning about various algorithms from his video sessions. So, I have also taken a lot of screenshots from his video sessions to make it more understandable. I have already written articles about the types of Artificial Intelligence(an intro to ML), the basics of linear regression and cost function as well. you can check that out here https://medium.com/@Shreedharvellay/hai-human-and-artificial-intelligence-e398dfa2917b , https://medium.com/@Shreedharvellay/data-science-is-a-hot-topic-that-has-been-evolving-in-the-recent-days-3e6281851846

So, first off let me explain to you the key idea of Gradient Descent and what does it really mean. Gradient Descent is an optimization algorithm, which is an algorithm that can be used for optimizing our cost function, in order to make our data model more accurate and less error prone. This algorithm may work very well for more supervised learning like Linear regression, Logistic regression ,etc.. But in this article I have decided to explain Gradient Descent with Linear regression,cause that’s the foothill of all algorithms.

**screenshot from** **Andrew Ng** **video tutor**

As, you can see on the left side of the screenshot above, it says repeat until convergence which means we are going to repeat this algorithm just like our very old for loop, until we get our expected output (here until we minimize our cost function). So, the “theta J” function suggests the name of the parameter that we are going to use in our data model and the operator “:=” means that we are going to use the updated value from the RHS of our equation and use it again to compute our minimization of cost function using Gradient descent , cause I said it’s going to be executed in a kind of loop format, until we successfully have minimized our cost function.

The extreme right side of the screenshot is to just remind you of the cost function that we are using it to apply gradient descent in order to minimize it.

**real time hill example for gradient descent**

The above image, is the perfect example to explain the key idea about Gradient Descent. Let, us assume that you are standing at the top of the hill and I am telling you to get to the foothill ASAP. Well, at first let us say you don’t wanna take risk by taking the unknown tracks in the hill and you just try to get to the foothill from the path that is clearly visible to you.

The foothill here, is the Global Optimization Factor where, you minimize your cost function . Each and every step of our foot are marked by the bolded ‘X’ marks in the hill. So we keep changing our paths(here changing our cost function) so that we could get down of the hill in order to achieve a successful minimization.

This, image shows us another way to get down the hill, say I am a risk taking person and I wish to trek to the foothill by taking some unknown but quite a short route than the first one. So, this is what basically a data scientist should do. We must always find out the best possible way to reach out more accuracy rate at the same the way has to be more efficient, than other routes. So, Our ultimate objective here is to reach to the foothill by taking on the shortest possible route.

What about the Learning rate(alpha)??? Well, the learning rate alpha is the responsible factor here for us to guide us to the foot hill. Of course, we might be a professional trekker, but we still need some guide. So here our guide is the Learning Rate.

There are two important drawbacks with our learning rate. I)If we choose our learning rate too large, then there is a problem of overshooting and not achieving our desired convergence point(global minimum point). II) At, the same time if we choose our learning rate to be too small, then it may take a lot of time to get to the convergence point, which is not very time consuming, but we still could reach to our global minimum.

So, it is always recommended to take 0.001,0.003,0.1,0.3 ,etc..as your learning rates based on your requirements.

Our gradient descent algorithm might visually look like the above image which is more like a bowl-shaped figure and we could call it as the Convex Function. Let us take a real bowl and rotate a ball on the bowl , well after some time the ball tends to stop at a point and reach the middle point of the bowl. This point is called as the Convex point, where we get our minimization factor from our gradient descent.

**real time gradient descent on a linear regression model**

The above is a real time example for how a gradient descent works with linear regression. The LHS of the image shows the plotted points and the blue line represents the “line of best fit” for the first hypothesis that is represented in the RHS of the image. Our, goal is to reach out for the best line of fit as possible by descending from our first hypothesis to various points until we reach our global optimum.

**line of best fit using gradient descent**

Finally, we got our line of best fit by gradually descending our steps to our converging point. The principle, that we discussed with the hill example is the same that we have done here, (ie) we have tried to take the most shortest path in order to achieve our global minimum(converging point).

So, this is it guys we’ve reached the end here. I hope you could understand the idea of what gradient descent really means, and if you wish to implement this idea into your own code this video from Siraj Raval could really help you to get working on.

In the next following article, I have planned to discuss about Normal equations and its comparison with gradient descent. How to effectively choose the right optimization algorithm for your own data model?? These are the topics that could be seen in the next article.

Sayonara, for now…….

Gradient Descent for Linear Regression

Written by Shreedhar Vellayaraj