Gradient descent: Why and How ?

An Theoretical approach for gradient descent and its variants

Murli Jadhav
Analytics Vidhya
4 min readSep 30, 2019

--

The minimising the cost function is very important in machine learning model. Here gradient descent is an iterative technique for minimising the cost function.cost function is error in prediction(Y_Actual-Y_Predicted)

Gradient Descent

If you see above image the person is on the top of the mountain or middle of the mountain here the person wants to go down with minimum number of steps. The person do this iteratively (with small steps) until he reaches at the bottom of the mountain.

The gradient descent works same. Here the mountain is the dimension of data, steps is the learning rate and the bottom(goal point) is the cost function with minimal errors.

Learning Rate:-

From the above example, the steps which are taken to reach the goal point (optimum point) is called as learning rate

How to choose learning rate?

Learning rate comparison

If the learning rate is too high then it bounces to convex function but not reached local minimum efficiently

If the learning rate is too small then it reached to local minimum but it takes too much time

We have to choose the learning rate such that the cost function decreases simultaneously. Here plot the graph of cost function vs itearation to check the correct learning rate.

Choosing good learning rate over iterations

In gradient descent we calculate the cost over all data points. and perform updates in weight in each iteration. For example if we have 10,000 points in train then we are calculating the W(weights) and Cost just in single iteration.

The problem with gradient descent is, Suppose the mountain has two minimum points the first reached minimum point is called as local minimum and second (last) minimum point is called as global minimum.then gradient descent stops the cost function when it reached to local minimum.

minimum and maximum of points [Image Source:click_here]

It is always better to optimise the cost function such that it reaches to global minimum. For this we have to use another gradient descent techniques. These are as follows:-

  1. Batch-Gradient Descent
  2. Stochastic Gradient Descent
  3. Mini-Batch gradient Descent

1.Batch-Gradient Descent :-

It is also called as vanilla gradient descent technique. In this technique we calculates the error at each observation but perform the update only when all observation done evaluating

Batch gradient descent technique is computationally expensive technique. It is very slow and large data-sets does not load into memory

2.Stochastic Gradient Descent:-

SGD performs the weight updates on each observation with every iterations. here we first need to shuffle the data-set so that we get completely randomised data set.

As our data-set is randomised data-set and error and weight updated for each example gradient weights are arriving at global minima instead of stuck in local minima.

SGD is faster than batch gradient technique. This technique also computationally redundant because it computes each example at a time

3.Mini-Batch Gradient Descent :-

This technique is a combination of sgd and batch gradient technique.First it split the data-set into small batches and then perform updates on each batches. Here as we take data in batch samples it removes the noise which is variance of the weight update. It is speedy technique as compared to both batch and sgd method.

Also view the next part i.e, Programmatic approach for this technique

--

--