Loss Functions Explained

Harsha Bommana
Sep 30, 2019 · 8 min read

In any deep learning project, configuring the loss function is one of the most important steps to ensure the model will work in the intended manner. The loss function can give a lot of practical flexibility to your neural networks and it will define how exactly the output of the network is connected with the rest of the network.

There are several tasks neural networks can perform, from predicting continuous values like monthly expenditure to classifying discrete classes like cats and dogs. Each different task would require a different type of loss since the output format will be different. For very specialized tasks, it’s up to us how we want to define the loss.

From a very simplified perspective, the loss function (J) can be defined as a function which takes in two parameters:

  1. Predicted Output
  2. True Output
Image for post
Image for post
Neural Network Loss Visualization

This function will essentially calculate how poorly our model is performing by comparing what the model is predicting with the actual value it is supposed to output. If Y_pred is very far off from Y, the Loss value will be very high. However if both values are almost similar, the Loss value will be very low. Hence we need to keep a loss function which can penalize a model effectively while it is training on a dataset.

If the loss is very high, this huge value will propagate through the network while it’s training and the weights will be changed a little more than usual. If it’s small then the weights won’t change that much since the network is already doing a good job.

This scenario is somewhat analogous to studying for exams. If one does poorly in an exam, we can say the loss is very high, and that person will have to change a lot of things within themselves in order to get a better grade next time. However if the exam went well, then they wouldn’t do anything very different from what they are already doing for the next exam.

Now let’s look at classification as a task and understand how the loss functions work in this case.

Classification Losses

Image for post
Image for post
Classification Neural Network Output Format

The number of nodes of the output layer will depend on the number of classes present in the data. Each node will represent a single class. The value of each output node essentially represents the probability of that class being the correct class.

Pr(Class 1) = Probability of Class 1 being the correct class

Once we get the probabilities of all the different classes, we will consider the class having the highest probability to be the predicted class for that instance. First let’s explore how binary classification is done.

Binary Classification

Image for post
Image for post
Sigmoid Function Graph Visualization

As the input to the sigmoid becomes larger and tends to plus infinity, the output of the sigmoid will tend to 1. And as the input becomes smaller and tends to negative infinity, the output will tend to 0. Now we are guaranteed to always get a value between 0 and 1, which is exactly how we need it to be since we require probabilities.

If the output is above 0.5 (50% Probability), we will consider it to be falling under the positive class and if it is below 0.5 we will consider it to be falling under the negative class. For example if we are training a network to classify between cats and dogs, we can assign dogs the positive class and the output value in the dataset for dogs will be 1, similarly cats will be assigned the negative class and the output value for cats will be 0.

The loss function we use for binary classification is called binary cross entropy (BCE). This function effectively penalizes the neural network for binary classification task. Let’s look at how this function looks.

Image for post
Image for post
Binary Cross Entropy Loss Graphs

As you can see, there are two separate functions, one for each value of Y. When we need to predict the positive class (Y = 1), we will use

Loss = -log(Y_pred)

And when we need to predict the negative class (Y = 0), we will use

Loss = -log(1-Y_pred)

As you can see in the graphs. For the first function, when Y_pred is equal to 1, the Loss is equal to 0, which makes sense because Y_pred is exactly the same as Y. As Y_pred value becomes closer to 0, we can observe the Loss value increasing at a very high rate and when Y_pred becomes 0 it tends to infinity. This is because, from a classification perspective, 0 and 1 have to be polar opposites due to the fact that they each represent completely different classes. So when Y_pred is 0 when Y is 1, the loss will have to be very high in order for the network to learn it’s mistakes more effectively.

Image for post
Image for post
Binary Classification Loss Comparisons

We can mathematically represent the entire loss function into one equation as follows:

Image for post
Image for post
Binary Cross Entropy Full Equation

This loss function is also called as Log Loss. This is how the loss function is designed for a binary classification neural network. Now let’s move on to see how the loss is defined for a multiclass classification network.

Multiclass Classification

The activation function we use in this case is softmax. This function ensures that all the output nodes have values between 0–1 and the sum of all output node values equals to 1 always. The formula for softmax is as follows:

Image for post
Image for post
Softmax Formula

Let’s visualize this with an example:

Image for post
Image for post
Softmax Example Visualization

So as you can see, we are simply passing all the values into a exponential function. After that, to make sure they are all in the range of 0–1 and to make sure the sum of all the output values equals to 1, we are just dividing each exponential with the sum of all exponentials.

So why do we have to pass each value through an exponential before normalizing them? Why can’t we just normalize the values themselves? This is because the goal of softmax is to make sure one value is very high (close to 1) and all other values are very low (close to 0). Softmax uses exponential to make sure this happens. And then we are normalizing because we need probabilities.

Now that our outputs are in a proper format, let’s go ahead to look at how we configure the loss function for this. The good thing is that the loss function is essentially the same as that of binary classification. We will just apply log loss on each output node with respect to its respective target value and then we will find the sum of this across all output nodes.

Image for post
Image for post
Categorical Cross Entropy Visualization

This loss is called as Categorical Cross Entropy. Now let’s move onto a special case of classification called multilabel classification.

Multilabel Classification

For this we can’t use softmax because softmax will always force only one class to become 1 and other classes to become 0. So instead we can simply keep sigmoid on all the output node values since we are trying to predict each class’s individual probability.

As for the loss we can directly use log loss on each node and sum it, similar to what we did in multiclass classification.

Now that we have covered classification, let’s now move on to regression.

Regression Loss

  • House price prediction
  • Person Age prediction

In regression models, our neural network will have one output node for every continuous value we are trying to predict. Regression losses are calculated by performing direct comparisons between the output value and the true value.

The most popular loss function we use for regression models is the mean squared error loss function. In this we simply calculate the square of the difference between Y and Y_pred and average this over all the data. Suppose there are n data points:

Image for post
Image for post
Mean Squared Error Loss Function

Here Y_i and Y_pred_i refer to the i’th Y value in the dataset and the corresponding Y_pred from the neural network for the same data.

That concludes this article. Hopefully now you have a deeper understanding of how loss functions are configured for various tasks in deep learning. Thank you for reading!

Read more Deep Learning articles at https://deeplearningdemystified.com

Deep Learning Demystified

Simple explanations for Deep Learning Concepts

Sign up for Deep Learning Demystified

By Deep Learning Demystified

Our biweekly newsletter summarizing the most important developments in the field. Covers a broader scope than our articles.  Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Harsha Bommana

Written by

https://www.deeplearningdemystified.com/

Deep Learning Demystified

Simple intuitive explanations for everything Deep Learning. From basic concepts to cutting edge advances.

Harsha Bommana

Written by

https://www.deeplearningdemystified.com/

Deep Learning Demystified

Simple intuitive explanations for everything Deep Learning. From basic concepts to cutting edge advances.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store