Never Forget Gradient Descent and Loss Function Ever Again

Understand why we use gradient descent and which loss function best suits your needs

Published in

Nerd For Tech

7 min readJul 18, 2021

In the previous article, we found out about the various activation functions and which one best suits our needs. After we choose an activation function, we are ready to feed data to our neural network. After the first record, we get a predicted value that is way off from the original value.

How do we tell the network that the predicted value is wrong by x amount and learn again?

Loss and Optimizer functions.

We will learn about the loss functions now and optimisers in a later article.

But before we calculate loss, we want to decide on the number of records, after which we will adjust the weights. Send the entire thing and then update the weights or update the weights after each row, or something in the middle? The way we pass the data determines how memory intensive and how quickly the network can find a pattern in the data.

Gradient Descent

Gradient descent is the underlying principle by which any “learning” happens. We want to reduce the difference between the predicted value and the original value, also known as loss. If you have spent any time learning machine learning, you must have seen a graph like this where we want to reach the bottom of that graph.

Error surface of a linear neuron with two input weights | Image from Wikimedia

But what does it mean?

Gradient Descent helps to find the degree to which a weight needs to be changed so that the model can eventually reach a point where it has the lowest loss. In other words, we say that a minima is found.

The above statement can be written as an equation as follows.

New Weight = Old Weight — a small change in W.

ΔW can replace a small change in W., So the updated equation is,

How do we calculate ΔW? That is where gradient descent comes in. To better understand gradient descent, let us consider an example.

I have taken a linear regression example as the math is more straightforward, but the same can be applied to a neural network.

Graph 1: Image by author | Example dataset plotted in a graph

As you can see in Graph 1, the line in red fits the data best, but how did we get to this line? How did the model know that this, and only this line, best fits the data?

The answer is gradient descent.

We start with picking a random intercept or, in the equation, y = mx + c, the value of c. We can consider the slope to be 0.5.

Graph 2: Image by author | Intercept is 0.

For c = 0 and the first row in the data, we get a predicted value of y= 0.5. Checkout Graph 2.

We find a difference between 0.5 and 1.5, the actual y value. And we get 1. If we move the intercept closer to the cluster, say, c = 0.5, we get the loss to be 1. If we continue to push the intercept, we get values like the one shown in the third graph.

Graph 3: Image by author | Loss vs intercept

Now for Graph 3, we know that the lowest value of the loss was when we get the intercept to be 1, also called the minima.

But what if you skipped this value of the intercept in your list of possible intercepts? And how do you know which intercept value to pick?

To counter this issue, we have gradient descent which picks the gradient value based on the slope of the above graph (Graph 4). The steeper the slope, the higher the change in intercept and vice versa. It continues till the slope is zero. After which, we can safely conclude that we have achieved minima, and the model has achieved the lowest possible loss.

The way to find a gradient descent is using differentiation. If we differentiate the above graph, we can get the tangent of the chart, which is the slope. To learn more about differentiation, you can check out this video by 3b1b: Gradient descent, how neural networks learn | Chapter 2, Deep learning.

Graph 4: Graph of a parabola and tangent line | image from Wikimedia

You can also set the maximum limit on the amount by which the weight can change, called the learning rate. The lower the learning rate, the lower is the weight change.

So the equation that we saw earlier, ΔW, can be updated as follows.

Substituting the value of ΔW in the formula above gives us the equation for gradient descent.

We now understand how gradient descent is calculated. Let us now look at the different modes of passing the data.

Types of Gradient Descent

There are three types.

Batch gradient descent: passing the entire data set and calculating the average loss. This gives a good understanding of the whole dataset, but it is slow and memory intensive.

The next best thing is called Mini-Batch gradient descent. We define a batch size, say n, n randomly chosen values are then selected, the cost is calculated for those data points, and the weight is updated accordingly. If n = number of rows, then it becomes batch gradient descent.

It is lighter on the memory as we can control the batch size. Still, there is higher volatility as the randomly chosen records might not give the best generalisation of the entire dataset.

Stochastic gradient descent is another way of updating weights; weights are updated after every record. It is quick and less memory intensive but has high volatility. It might take a lot of time to converge to a minima.

None of those mentioned above ways of passing the data is wrong; we can choose any method based on your needs.

Loss Function

Once we have a predicted value of Y, we want to check how close is that value to the original. The simplest way of finding that out is to subtract them and get the loss.

The Problem with this approach is that if we want to average all the losses, we might run into a problem where the positive loss and the negative loss cancel each other. To avoid this, we can use the absolute value, Also called L1 loss or Mean Absolute Error.

Using the absolute value of the loss makes sense if the data we are using might have outliers like this. But it is not so great if we have data that is tightly packed. If we want to calculate loss highlighting the loss between tightly packed data, We can square the difference.

L2 Loss or Mean Square Error finds the square of the losses and averages it. We cannot use this if our data has outliers, as the outliers can skew the data one way leading to a poorly fit model.

But what if your data has both outliers and tightly packed data? Which should you use? In comes Huber loss.

It takes the best of both L1 and L2 loss and fit the data perfectly. All it does is that if the loss is more significant than a value delta, then it finds the absolute loss; otherwise, it finds the squared loss.

Now, these losses we discussed are for regression problems. That is, our neural network has only one output. What if we have multiple results, like in the case of a classification problem?

One of the functions we can use is the cross-entropy function.

Cross entropy might sound complex, but it is essentially the sum of products of the expected value and log of the predicted value of all the classes.

For example, we have three classes, red, green, blue, and we want to find the loss of the predicted value concerning the actual value. Cross entropy loss in this case is:

In classification problems, the expected values of the classes other than the one to be predicted is zero, so the cross-entropy loss in that case becomes

Why do we use log instead of squares? The simple answer is that because log penalises loss closer to zero, which we want, rather than the ones closer to 1 in the case of squares, which is what we don’t want.

There are a lot more loss functions. You can read more about them here.

Conclusion

We discussed what gradient descent is and why the equation is the way it is. We also looked at the various forms of passing the data to a neural network, and finally, we learnt about the different loss functions. I hope this helped you understand why we are using certain equations and how we are using them. The following article will be about optimisers are and their types. Do keep your feedback coming. Let me know what you want me to talk about in the coming articles.