Gradient descent for Machine learning

Arvind Natarajan
RedBlackTree
Published in
4 min readDec 17, 2017

Gradient descent is an optimization algorithm which operates iteratively and aims to minimize a function to it’s global minimum. This is considered a general purpose optimization algorithm used in many areas of mathematics.

Fig 1: Fitting a function with gradient descent

The animated figure attached above, illustrates how gradient descent fits the function curve iteratively. Although the image animates in a fast manner, this process makes progress over each iteration. In order to make sense out of it, let me bring up a simple linear regression function,

Fig 2: Fitting a simple linear regression function

This figure demonstrates what we call a simple linear regression function which after being fit will be a bridging the pink lines where the cost to fit all features are at minimum (overall).

Once this function is fit with trained parameters, future predictions can be made based on this model. The below image is another depiction of fitting a regression with gradient descent but with a slightly complex function curve.

Fig 3: Changing regression surface by varying parameters

There are certain variations of gradient descent, that are used for specific purposes in machine learning which are

Batch gradient descent:

This method uses the entire accumulated data to fit a model once and for all. The model is then used to predict future outcomes. Batch gradient descent is chosen when you have accumulated some data and want to make analyses on that, where you don’t really want to consider new data and changing preferences in future. By far this method fits a model with the best path to the global minimum which is often much steeper than the other methods. This method is used for Off line learning in ML.
Example:
Linear regression -
Housing prices prediction,
Logistic regression - Spam detection in mail and text messages

Stochastic gradient descent:

This and the Mini-Batch gradient descent method are slight variations on how we feed training data to the optimization algorithm. In the former case (Batch gradient descent), we feed all available training data to the algorithm and make analyses out of it, this works for specific conditions where in case of on line learning where the data is constantly being accumulated and we don’t have much time to act on changes, we could fit the parameters in real time for each training example and thus end up with a better parameter (hopefully) for each iteration which aids in the business case for better prediction while continuously improving the algorithm. SGD is computationally inexpensive but performs slightly worse in noisy data because of the lack of averaging.
Example:
Offering discounted rates for users based on the probability of their conversion rate.

Mini-Batch gradient descent:
Stochastic gradient
acts on each training example to fit the parameter which could be very useful for real time learning but this is often unnecessary in peculiar cases. And the part where mini batch excels is by averaging out the noise in individual inputs that causes the SGD to take on several local optima in the curve. This is very apparent in the images attached if you take a look at the path taken by each method, SGD often leads a varying path towards the global optima while Mini-Batch does it better.
Example:
Anything that can be done by the other methods can adopt mini batch and have more computationally viable results.

So, which is is better?
There is no better algorithm here, as it all depends on the use case. But in production Batch-Gradient descent is often avoided because it is computationally expensive even with vectorization (For large datasets). In those cases, Mini-Batch gradient descent is used with vectorization which often performs much better computationally and results are very close to what Batch-Gradient descent would produce.

Credits:

Fig 1: https://media.giphy.com/media/jUvoKxaLa7kxG/giphy.gif
Fig 2:
https://media.giphy.com/media/Lyejb62QjQepG/source.gif
Fig 3: http://www.math.yorku.ca/SCS/spida/lm/

--

--