Gradient Descent and Stochastic gradient descent

Harsh Arora
5 min readOct 14, 2020

--

Gradient Descent: It is an algorithm used to update the set of parameters in a manner to minimize the lost function. While using the gradient descent it takes all the data in your training to do a single update of a particular parameter. It can be used in both Deep learning(Neural Networks) while updating the weights in Backpropagation and Machine learning but in maximum case, it is used in Machine Learning.

How does Gradient Descent work with Deep Learning?

Let suppose we are working with Multi-Layer Neural Network. After the training, we got our loss function. Now to reduce the Loss function we can go with Gradient Descent. To reduce the Loss function or Cost Function we have to go with Back Propagation and update the weights. The weight update formula is given below.

  1. Wx=Initial Weights.
  2. a(Alpha)=Learning rate
  3. The third is the derivative of error or loss with respect to old weight.

We all know that the calculation of a derivate is always gives us a slope from a particular point or weight and we have to decide whether our slope is positive or negative. The main aim of gradient descent is to reach global minima.

Global Minima: A global minimum, also known as an absolute minimum, is the smallest overall value of a set, function, etc., over its entire range.

Learning Rate: It determines the step size at each iteration while moving toward a minimum of a loss function. According to the formula of the gradient descent or weight update, we have to decide the value of the learning rate. We should select the value of the learning rate very carefully. Generally, we have some optimizers to decide the learning rate.

The main question is the value of the learning rate should be high, low, or medium? We will check how the learning rate works.

  1. High Learning Rate: It is very dangerous to set a high learning rate because it cannot find the global minima point it will show undesirable divergent behavior.

2. Low Learning Rate: If we set our learning rate too low then it will take more and more time to find the global minima. It means it set our weights with very low value and it can be a major problem with our neural networks.

3. Medium Value: It is better to take the medium value of the learning rate. It will take care of the speed and time to find global minima. But it is very hard to decide the correct value of the learning rate. We can see after the training if our loss function is not decreasing it means maybe there is a problem with our learning rate.

Now we have to talk about how derivative works in Gradient Descent.

How it works if our initial weights on the right side as we can see below in the below image?

We can see our initial weights are on the right side it simply means it is a positive slope. It simply means after the calculation the derivative will be positive and it tries to reduce the old weight to reach the global minima.

We can say that if our derivate of loss with respect to weight is positive and the learning rate is positive then the whole calculation will be a subtraction from the old weight. It simply means our initial weight getting reduce and it tries to reach the global minima.

What if our initial weights are on the left side?

The initial weight is on the right side so in short our derivation of loss with respect to weight will be a negative slope. Then our calculation of learning rate and the derivative is negative because our learning rate is positive and the derivative is negative. So it means the old weight is subtracted from a negative number. Our weight will be added by some value and it helps us to reach the global minima.

Stochastic Gradient Descent (SGD)

We already saw in Gradient Descent that it takes all data points in one time and it will calculate the output but SGD is different from that because it will consider one point and try to update the weights. Gradient descent always needs more computational power than Stochastic Gradient Descent because in gradient descent it takes all data in one time.

SGD is basically using in Machine Learning for solving Linear Regression Problems. For updating of weights in Neural Network, we generally use Mini-batch Gradient Descent. In mini-batch Gradient Descent, we have to specify a parameter that helps to an algorithm to decide how many numbers of data it should take at one time. So the computational power will be less than Gradient Descent to update the weights. There is also a problem with Mini-batch gradient descent it will get some noise because of their movements but we can remove this noise by using exponential moving average.

This graph shows the noises of our optimizers. We can see Stochastic Gradient Descent and Mini-Stochastic Gradient Descent has many noises so in order to remove those noises we can use Exponential Moving Average.

Exponential Moving Average: It is a type of average that helps us to give more importance to our recent data points.

We can see in the above graph that while using Stochastic Gradient Descent we have this type of noise so to remove that we have to use to Exponential Moving Average. It will convert those noises into some straight line.

Thank You

If you have any queries You can ask on Linkedin

Link: https://www.linkedin.com/in/harsharora0703/

--

--