Optimizer & Loss Functions In Neural Network

Published in

Analytics Vidhya

11 min readOct 3, 2020

In this blog we will understand about the various optimizers and loss functions which are most commonly used while training our neural networks. The prerequisite for this is to have basic understanding of how gradient descent algorithm works so, I would highly recommend you all to refer my previous blog on Introduction To Neural Network for better understanding over Gradient Descent algorithm which is used during the backward propagation for updating our weights and minimize the cost function where our aim is to reach the global minima.

After having gone through the basics of gradient descent algorithm we will look at the following things:

Variants of Gradient Descent.

2. Challenges / Problems with Gradient Descent.

3. Different types of optimizers.

Let us recall the gradient descent equation which is: w = w - alpha * dE/dw where, dE/dw represents rate of change of error with respect to weights and alpha represents the learning rate. Here ‘E’ is the cost function.

There are multiple ways to calculate this cost functions. Based on the way we are calculating this cost function there are different variance of gradient descent. Lets us understand it in a better way.

VARIANTS OF GRADIENT DESCENT.

BATCH GRADIENT DESCENT : Let us say there are total ‘m’ observations in our dataset and we are using all of these observation to calculate cost function ‘E’. So, we are taking entire training set, performing forward propagation and calculating the cost function. Then we update the parameters what we call as weights with rate of change of this cost function with weights. Also we must remember that, as we are using the entire training set over here, parameters will be updated only once per epoch.

STOCHASTIC GRADIENT DESCENT : If we are using single observation to calculate the cost function it is called as stochastic gradient descent commonly abbreviated as SGD. Here each iteration is taken passed it through neural network calculate the error and then update the parameters. Then we will take the second observation and perform similar steps. This process keeps of going for the number of the observations in our dataset. So for 1 epoch we will be having ‘m’ updations, where ‘m’ represents the number of observations in our dataset.

MINI BATCH GRADIENT DESCENT : In this type of gradient descent we take a subset from the entire training dataset and calculate the cost function. This type of gradient descent is used mostly while training deep learning model.

Let us see the variation of the cost functions :

Coming to batch gradient descent as we are taking the entire dataset the cost function reduces smoothly. Coming to SGD it is not that smooth, since we are updating the parameters based on single observation so there are lot of iterations involved. It can also be possible that model starts to learn noise as well. Updation of cost function in case of min-batch gradient descent is smoother as compared SGD since we are not updating the parameters after every iterations but after every subset of the data.

Let us see the computation cost and time:

We can see that mini-batch gradient descent gives better result than others and is also used most often while building deep learning models.

So, these are the variants of gradient descent which we discussed and you will come across these quite often. Hope now you are well versed with the terms??

Let us now jump to the challenges what we face when we are using the gradient descent.

CHALLENGES OF GRADIENT DESCENT .

In this part we will focus on the challenges of gradient descent. Let us take a look at it.

It get stuck at local minima (SGD Momentum)

Let us say below is the graph of our cost function. Since we want to minimize the cost we would like to reach ‘Global Minimum - The point which has minimum cost value in the entire cost function or globally is known as global minima.” The lowest point among its neighbors is called as Local Minima. This is the point where the gradient descent generally gets stuck.

Using the gradient descent algorithm we will calculate the slope at each and every point and after some point reach local minima. The slope at local minima is 0 so dE/dw term becomes 0 which represent the gradient or slope because of which parameters will not get updated and will get stuck .

Our aim is not to reach local minima, our aim is to reach global minima. So, we require some push at the local minima which takes us out from this scenario.

Let us understand this with an example:

Consider we have a ball at certain height with initial speed as ‘u’ . Now the ball rolls down the hill or slope with some speed say ‘v’ , We can obviously say that v > u since ball is rolling down. When the ball reaches the local minima it would have gained some speed which will eventually give it a push and ball will come out of local minima. So, we saw to push the ball out of local minima we need some accumulative speed kind of thing. In terms of neural network we can say that the accumulative speed is equivalent to weighted gradients.

Here if we see local minima, the slope or the gradient will be 0 which will eventually make current gradient at current time ‘t’ equal to 0. We will still be left with some value i.e. previously accumulated gradient at ‘t-1' and hence the weighted gradient ‘Vt’ will have some value which will give the require push and it will come out of local minima. The ‘beta’ value tells us how much value it should give to current and previously accumulated gradient. Generally beta value is taken as 0.9 which means it will give 10% weightage to current gradient and 90% weightage to previously accumulated gradient. Now ‘Vt’ being our new gradient will be used to update the parameters of neural network. This is also know as Stochastic Gradient Descent with Momentum.

So, using the concept of Stochastic Gradient Descent with Momentum we can solve the problem of getting stuck at local minima.

2. Same learning rate throughout the training process. (RMSProp)

This is the second issue which we will encounter while training our neural network. In general it might be possible that some variables will lead to faster convergence as compared to other variables but if we apply same learning rate throughout the training phase, we are enforcing them to be in sync. This might lead to slower convergence.

Our aim is that when the training progress, learning rate should also be updated as per the cost function. Let us see how we can get this resolved.

Going back to our gradient descent equation where dE/dw is the gradient which gets updated during our training process, so we can use this term for updating the learning rate. There is some issue with it, as some gradients might be positive or negative at some point so they might cancel each other so, to remove the sign we can take sum of squares of gradients and use this value to update the learning rate.

There can be another issue with this approach i.e. The square of any number be it negative or positive will always be a positive number so, adding all the positive numbers will always increase the theta value and in return it will reduce the learning rate and after some iterations learning rate will tend to zero and become very small and our parameters will be almost similar to previous parameters which will lead to slower convergence.

To overcome this issue we can use SGD with momentum equation which assigns some magnitude to our current gradient and previously accumulated gradients and we will take a square of our gradients to nullify the effect of negative values.

This is updated equation of RMSProp. As we can see from the above equation we are using alpha which is divided by square root of weighted average plus some small error. The error is used in the denominator so that the value does not becomes 0. Generally this error value is very small i.e. 0.00000001.

Now let us understand how this equation helps us to update the learning rate during the training process- When the square of gradients is high, weighted average value will be high which in return reduce the learning rate. Similarly when square of gradients is low, weighted average value will low which in return increases the learning rate.

DIFFERENT TYPES OF OPTIMIZERS.

Stochastic Gradient Descent with Momentum.
RMSProp.
ADAM.

Till now we have covered two optimizers which was discussed in the above section to overcome the challenges faced while using gradient descent. In SGD with momentum we saw it resolved the issue of getting stuck at local minima using the weighted sum of previously accumulated gradient. In RMSProp we saw that it resolved the problem of same learning rate for all the parameters using sum of square gradients.

Now, we will look at the most commonly and widely used optimizer i.e. ADAM. It combines both SGD with momentum to resolve local minima problem and RMSProp which uses sum of square of previous gradients to resolve same learning rate issue.

From the updated equation of ADAM, we can see that we are using ‘Vt’ here as the gradient which is the weighted sum of current and previously accumulated gradients to resolve the local minima issue also the square root of the sum of squares of gradient resolves the same learning rate issue.

So, this was the detailed intuition behind the various types of optimizers which are used in building our neural network. I hope you enjoyed learning it.

LOSS FUNCTIONS:

In this section we will learn about what are loss functions and which of them are mostly used in deep learning.

Before looking at the loss function, lets recap about the neural network a bit. If we see the below figure, we will come to know how are our predictions been calculated and error function helping us to update the weights and biases in the neural network which helps us to improving the predictions and performance of the model.

MEAN SQUARED ERROR: Until now we were using MSE (Mean Square Error) to calculate the error / loss function / cost function. Let us see how we can calculate MSE.

MSE is generally used when we have regression type of problem and target variable is continuous.

2. MEAN ABSOLUTE ERROR: MAE is another metric which is used to calculate the loss function. Let us see how we can calculate MAE.

MAE is also used when we have regression type of problem and target variable is continuous.

3. ROOT MEAN SQUARED ERROR: RMSE is just the square root of MSE. We know how to calculate MSE so in RMSE we will just calculate the square root of the value obtained.

So, until now we saw some of the loss functions for the regression task. Lets us now look at the loss functions used for classification task.

Classification task can be further divided into binary classification and multiclass classification. In binary classification we only 2 target classes in our target variable where as in multiclass classification we can have more than 2 classes in target variable to predict. Let us understand the loss function used in both:

1. BINARY CROSS ENTROPY / LOG LOSS.

“It is the negative average of the log of corrected predicted probabilities”

It is most common type of loss function used for classification problem. It compares each of the predicted probabilities to the actual class output which can wither be 0 or 1. It then calculates the scores which penalizes the probabilities based on the distances with actual value.

Let us understand what are the corrected probabilities with an example.

Here, we see corrected probabilities column against each observation. We have predicted probabilities column which comprises the probabilities of class 1. For actual values as 0, predicted probabilities represent the probability of class 1 so, the probability that it belongs to class 0 would be obtained by subtracting the value from 1. This would be happen for ID8, ID2 and ID5 in our example.

So, we take the corrected probabilities belonging to target class 1 as it is and for the probabilities where actual target is 0, we subtract the predicted probability from 1. We then take log of these corrected probabilities as taking log will give less penalty for small differences (for eg. if actual is 1 and predicted probability is 0.9 we assign less penalty, if actual is 1 and predicted is 0.6 we assign little higher).

Instead of calculating the corrected probability, we can use the formula that is written below. Suppose the actual class is 1, (1-y) term will 0. Similarly if actual class is 0, y*log(p) term becomes 0

So, this is how we calculate binary cross entropy and it is very useful loss function used for classification problems.

CONCLUSION

So, we learnt about various optimizer like Stochastic Gradient Descent which resolve the issue of local minima, RMSProp which resolves the issue of same learning rate and then we learnt about ADAM which was the combination of both. Then we learn about various loss functions which can be used in regression and classification problem.

Do connect with me on LinkedIn : https://www.linkedin.com/in/gaurav-rajpal/

Stay tuned for further updates on demo projects where we play with image dataset in Deep Learning.

Regards,

Gaurav Rajpal (gauravrajpal1994@gmail.com)

Optimizer & Loss Functions In Neural Network

So, this was the detailed intuition behind the various types of optimizers which are used in building our neural network. I hope you enjoyed learning it.

LOSS FUNCTIONS:

Written by Gaurav Rajpal