Loss Functions

Published in

Artificialis

19 min readAug 14, 2021

Loss functions explanations and examples

Good morning! Today is a new day, a day of adventure and mountain climbing! So like the good student you are, you attended today’s class but didn’t understand :( Luckily, you got me, your personal professor. I asked your classmates about today’s class and they told me that the professor taught you about Loss Functions, some even told me that he taught them how to climb down from different mountains. Well, grab your hiking gear and follow my lead, we are going to climb down from a high mountain, higher than Everest itself.

The Loss Functions can be called by the name of Cost Functions, especially in CNN(Convolutional Neural Network).

Do you remember that the objective of training the neural network is to try to minimize the loss between the predictions and the actual values? If yes, good! If not, read my previous blog.

The Loss Function tells us how badly our machine performed and what’s the distance between the predictions and the actual values. There are many different Loss Functions for many different problems and I’m going to teach you the famous ones.

Mean Squared Error(MSE)

The Mean Squared Error or MSE calculates the squared error or in other words, the squared difference between the actual output and the predicted output for each sample. Sum them up and take their average. I might say that this Error Function is the most famous one and the most simple one, too.

Note: Some people call the MSE by the name of L2 Loss

ŷᵢ - Predicted Output
yᵢ - Actual Output
n - Training samples in each minibatch (if not using minibatch training, then n = Training sample). If we have 1000 training samples and we are using a batch of 100, it means we need to iterate 10 times so in each iteration there are 100 training samples so n=100.
No matter if you do (ŷᵢ - yᵢ) or (yᵢ - ŷᵢ), you will get the same result because, in the end, you square the distance.

Variations of Mean Squared Error

Some people use Half of the MSE and some use the Root MSE.

Half of MSE is used to just not affect the error when derivative it because when you derivative HMSE(Half of MSE) 0.5n will be changed to 1/n. The results might differ but it’s not that important to emphasize.

MSE, HMSE and RMSE are all the same, different applications use different variations but they’re all the same.

Example of MSE

So, like always your professor gave you homework! Lovely :D He gave you a dataset and ask you to calculate the Loss Function using the MSE.

The dataset:

So what do we know?

n = 3 because there are 3 samples
ŷ₁ = 48, ŷ₂ = 51, ŷ₃=57
y₁ = 60, y₂ = 53, y₃ = 60

The equation that we got

After we understood our dataset is time to calculate the loss function for each one of the samples before summing them up:

L₁ = (ŷᵢ - yᵢ)² = (60–48)² = 144
L₂ = (ŷᵢ — yᵢ)² = (53–51)² = 4
L₃ = (ŷᵢ — yᵢ)² = (57–60)² = 9

Now that we found the Squared Error for each one of the samples it’s time to find the MSE by summing them all up and multiply them by 1/3(Because we have 3 samples):

What! 52.3!! This is huge! We must minimize it, do you remember how? Of course, you do! We need to train our neural network!

Note: If there is more than one output neuron, you would add the error for each output neuron, in each of the training samples.

For more than one output neuron. j = number of output neurons.

MSE is high for large loss values and decreases as loss approaches 0. For example, if we will have a distance of 3 the MSE will be 9, and if we will have a distance of 0.5 the MSE will be 0.25 so the loss is much lower.
The MSE is a quadratic/convex function
One global minimum to find
Getting stuck at the local minimum is eliminated

Mean Absolute Error(MAE)

Very similar to MSE but instead of squaring the distance, we take the absolute value of the error. We take the absolute value of the error rather than squaring it. Measures the average magnitude of the error across the predictions.

ŷᵢ — Predicted Output
yᵢ — Actual Output
n — Training samples in each minibatch (if not using minibatch training, then n = Training sample). If we have 1000 training samples and we are using a batch of 100, it means we need to iterate 10 times so in each iteration there are 100 training samples so n=100.
No matter if you do (ŷᵢ — yᵢ) or (yᵢ — ŷᵢ), you will get the same result because, in the end, you take the absolute distance.

Example of MAE

The professor of course didn’t want you to practice only on one Loss Function but on every loss function that he taught you so here is the given dataset:

So what do we know?

n = 2 because there are 2 samples
ŷ₁ = 1, ŷ₂ = 0
y₁ = 0.8, y₂ = 0.6

Time to calculate the errors:

L₁ = |1–0.8| = |0.2|= 0.2
L₂ = |0–0.6| = |-0.6| = 0.6

Now we need to find the MAE:

When to use L1 and when to use L2?

Okay, Tomer, you taught us two Loss Functions that are very similar but why teach us some loss functions if we can use only one? Well, each loss function has its own proposal for its own problem. To explain to you which one to use for which problem, I need to teach you what are Outliers.

Outlier’s basically a deviation from your data points.

So, for example, if you consider this model above, you can see the following linear line. But you can see some small deviations, which are very far from the samples. Those deviations are called outliers.

So, an outlier is a data point that deviates from the original pattern of your data points or deviates or from most of the data points.

L1 or L2 loss?

L2 Loss (MSE) is more sensitive to outliers than L1 Loss (MAE). When there are large deviations, the error is big, and when squaring a big number, it gets bigger. For example, if the error is 10, then MAE would give 10 and MSE would give 100.
Consequently, the L1 Loss Function is more robust and is generally not affected by outliers. On the contrary, the L2 Loss Function will try to adjust the model according to these outliers values, even at the expense of the other samples. Hence, the L2 Loss Function is highly sensitive to outliers in the dataset.

Huber Loss

What if we combine L1 and L2 Loss (best of both1)

What we really would like is that when we approach the minima, use the MSE (squaring a small number becomes smaller), and when the error is big and there are some outliers, use MAE (the error is linear). We can achieve this using the Huber Loss (Smooth L1 Loss), a combination of L1 (MAE) and L2 (MSE) losses.

Can be called Huber Loss or Smooth MAE
Less sensitive to outliers in data than the squared error loss
It’s basically an absolute error that becomes quadratic when the error is small. How small that error has to be to make it quadratic depends on a hyperparameter, δ(delta), which can be tuned. The choice of the delta is critical because it determines what you’re willing to consider as an outlier.

So when the error is smaller than the hyperparameter delta it will use the MSE Loss Function otherwise it will use the MAE Loss Function.

For example, let’s say that delta equals 1. When the error is smaller than 1 it means that we have approached zero therefore, we want to use the MSE and the half is there for the differentiation because later on in the backpropagation, when you differentiate this, then these two comes down here and you’ll have this basically you’ll have this half removed. So, you will end up with y minus X. If you don’t include the half, then when you differentiate themselves, get two times your error. When the error is bigger than 1 that it will use the MAE minus 0.5.

This is Huber Loss, the combination of L1 and L2 losses.

Quadratic (Like MSE) for small values, and linear for large values (like MAE). The Huber loss combines both MSE and MAE.

Here you can see the graph with the different tuning of the hyperparameter delta.

Binary Cross-Entropy

Oh! Did you hear about it? Indeed, well, this is the most famous and the most useful loss function for classification problems using neural networks.

So, the Cross-Entropy function is basically the negative pf the logarithmic function, -log(x)

This is pretty simple, the more your input increases, the more output goes lower. If you have a small input(x=0.5) so the output is going to be high(y=0.305). If your input is zero the output is extremely high.

The Binary Cross Entropy is usually used when output labels have values of 0 or 1
It can also be used when the output labels have values between 0 and 1
It is also widely used when we have only two classes(0 or 1)(example: yes or no)

y - Actual class label (0 or 1)
p - Predicted probability for the class
c - Number of classes
n- Number of samples

Ahhhhhh…..Tomer? What is this? Too complicated for me…too complicated. EXPLAIN!!

Alright, let’s look at the case where we have two classes either 1 or 0 class.

We have only one neuron in the output even though that we have two classes because it can be used as two classes, we can know the probability of the second class from the probability of the first class. For example, if the probability of the first second is 0.8 so the probability for the second class is 1–0.8=0.2

We don’t need the second sum because we only have one output neuron so it will not sum anything.

For each sample we are going to take one equation:

If label is 1: -yᵢ⋅ log(pᵢ)
If label is 0: -log(1-pᵢ)
It happens because if yᵢ = 1 so (1-yᵢ)log(1-pᵢ) = (1–1)log(1-pᵢ)=(0)log(1-pᵢ) = 0. If yᵢ=0 so yᵢ log(pᵢ) = 0 ⋅ log(pᵢ)=0

def CrossEntropy(yhat, y):
 if yhat ==1:
  return -log(y)
 else:
  return -log(1-y)

We do this procedure for all samples n and then take the average.

Okay Tomer, you taught how to solve it when we have two classes but what will happen if there are more than 2 classes? The professor gave us a task with 4 classes! HELP!!

The given task with the output neurons values

The good thing is that the professor gave you only one sample so we can remove one sum function and we will get the new equation:

But then he told us something about Multi-Label Classification, what? What’s Multi-Label Classification? Well, this type of classification requires you to classify multiple labels for example:

What kind of classes you can spot?

Beach
Dog
Husky
Sitting
Laying
Waves
Hill

This is multi-label classification, you just detected more than one label!

So how the BCE works in multi-label classification? Well, let’s explore the maths!

We are going to use this equation and let’s consider that n equals 1

If the label is 1 and the prediction is 0.1 -> -y ⋅ log(p) = -log(0.1) -> Loss is High => Minimize!!!
If the label is 1 and the prediction is 0.9 -> -y ⋅ log(p) = -log(0.9) -> Loss is Low
If the label is 0 and the prediction is 0.9 ->-(1-y)⋅ log(1-p)=-log(1–0.9) = -log(0.1) -> Loss is High => Minimize!!!
If the label is 0 and the prediction is 0.1 ->-(1-y)⋅ log(1-p)=-log(1–0.1) = -log(0.9) -> Loss is Low

After we minimize the loss we should get:

When label is 1 and prediction is 1 -> -log(1) = 0
When label is 0 and prediction 0 -> -log(1–0) = 0

These are the ideal case ^

The Cross-Entropy Loss is usually used for classification problems.

So the cross-entropy loss penalizes probabilities of correct classes only which means the loss is only calculated for correct predictions. If you related to the binary cross entropy loss, then basically we’re only taking the first term.

i - Class number
c - Number of classes
yᵢ - Actual Label
ŷᵢ- Predicted Label

The actual labels should be in the form of a one hutz vector in this case. One hot vector means a vector with a value of one in the index of the correct class.

If Label = 0 (wrong) -> No Loss Calculation
If Label = 1 (correct) -> Loss Calculation
Loss only calculated predications! It penalizes probabilities of correct classes only! In other words you don’t care what’s the probability of the wrong class because you only calculate the probability of the correct class

For example

And so we come back to our lovely professor who gives us more homework than before. So what we got?

Supoose you have 4 different classes to classify. For a single training example:

The ground truth (actual) labels are: [1, 0, 0, 0]
The predicted labels (after softmax(an activation function)) are: [0.1, 0.4, 0.2, 0.3]

Cross Entropy Loss = -(1 ⋅ log(0.1) + 0 + 0+ 0) = -log(0.1) = 2.303 -> Loss is High!!

We ignore the loss for 0 labels
The loss doesn’t depend on the probabilities for the incorrect classes!

Oh wow! It was pretty easy. Indeed because of the one hot vector that has one correct class for each sample which means the summation over classes c is eliminated.

Sometimes, the cross entropy loss is averaged over the training samples n:

n -> Mini-batch size if using mini-batch training
n -> Complete training samples if not using mini-batch training

Well of course it will never be that easy. The professor gave us another problem but this time the prediction is almost correct!

Supoose you have 4 different classes to classify. For a single training example:

The ground truth (actual) labels are: [1, 0, 0, 0]
The predicted labels (after softmax(an activation function)) are: [0.9, 0.01, 0.05, 0.04]

Cross Entropy Loss = -(1 ⋅ log(0.9) + 0 + 0+ 0) = -log(0.9) = 0.04 -> Loss is Low!!

We ignore the loss for 0 labels
The loss doesn’t depend on the probabilities for the incorrect classes!

Kullback Leibler divergence(KL divergence)

Okay, we can stop here, go to sleep and yeah. Bye bye! If you’re still here good job if not, enjoy your day. Now, from this part the professor started to teach us loss functions that none of us heard before nor used before. Take a paper and a pen and start to write notes.

KL divergence measures how two probability distributions P(x) and Q(x) are different. If these two distributions are different, KL divergence gives a high value. If the two distributions are similar, KL divergence gives a low value. It is 0 when the two distributions are equal. So, it tries to make two distributions similar to each other.

We want to estimate the probability distribution P with normal distribution Q. For this we will use the probability distribution P to approximate the normal distribution Q:

A example of the idea behind KL divergence

The equation for it it’s the difference between the entropy and the cross entropy loss:

It is never negative and only 0 when yᵢ = ŷᵢ since log(1) = 0

And like always another task!

We need to find the error here.

The answer for the task

KL divergence is not symmetric -> You can’t switch yᵢ and ŷᵢ in the equation

But why to learn it if it’s not that useful? Well, the answer is simple. One major use of KL divergence is in Variational Autoencoders(More on that later in my blogs).

The professor was a little bit mad at you for not listening and gave you another question. “Is KL-divergence same as cross entropy for image classification?”. The answer is yes but why? Because in Image classification, we use one-hot encoding for our labels. Therefore, when yᵢ is the actual label, it equals 1 -> log(1) = 0, and the whole term is cancelled. When yᵢ is not the correct label, it equals 0 and the whole term is also cancelled out.

The equation in Image Classification

Therefore, KL divergence = Cross Entropy in image classification tasks.

Contrastive Loss

This is a different type of error lost, a type that we didn’t meet before. This loss is used to measure the distance or similiary between two inputs. For example, let’s take the inputs as images.

The first two images are very similar because they are from the same person. So we encourage their distance to be small. The loss will minimize the distance between these two images since there are the same. The other 2 images are from different people. So, we encourage the distance to be large because we want the models to predict that these two images aren’t similar.

Contrastive Loss is a distance-based Loss Function (as opposed to prediction error -based losses like cross entropy) used to learn discriminatives features for images.
Like any distance-based loss, it tries to ensure that semantically similar examples are embedded close together. It is calculated on Pairs.
This loss measures the similarity between two inputs.
Each sample is composed of two images (positive pairs or negative pairs). Our goal is to maximize the distance negative pairs and minimize the distance between positive pairs.
We want small distance between the positive pairs (because they are similar images/inputs), and great distance than some margin m for negative pairs.

Contrastive Loss Equation

d is the Euclidean distance and y is the label

Euclidean distance Equation

y = 1 when the two images are similar
y = 0 when the two images are dissimilar
O(x) represent the image features
During training, an image pair is fed into the model with their ground truth relationship y
margin is used for confidence. If two images in pair are dissimilar, then their distance should be at least margin, or a loss will be incurred.

Without the margin the model will think that they are similar but with the margin the model will be able to find the distance.

Hinge Loss

The Hinge Loss is associated usually with SVM(Support Vector Machine). To start with this loss, we need to understand the 0/1 Loss.

Consider y to be the actual label (-1 or 1) and ŷ to be the predictions. Let’s try to multiply the two together: y ⋅ ŷ

If the label is -1 and the prediction is -1: -1(-1) = +1 -> Positive. If we follow the graph, any positive will give us 0 loss.
If the label is +1 and the prediction is +1: +1(+1) = +1 -> Postivie. If we follow the graph, any positive will give us 0 loss.
If the label is -1 and the predicition is +1: = -1(+1) = -1 -> Negative. If we follow the graph, any negative will give us 1 loss.
If the label is +1 and the prediction is -1: +1(-1) = -1 -> Negative. If we follow the graph, any negative will give us 1 loss.

Rather than penalizing with 1, we make the penaliztion linear/proportional to the error. But what if we include a margin of 1? We can introduce confidence to the model! We can optimize until a margin, rather than penalizing for any positive prediction.

When signs match -> (-)(-) = (+)(+) = + -> Correct Classification and no loss

When signs don’t match -> (-)(+) = (+)(-) =- — >Wrong Classification and loss

A marginal loss, usually used for SVMs
Used when labels are [-1,1]
It penalizes not only wrong predictions, but correct predictions which are not confident enough
Faster than cross entropy but accuracy is degraded

The Hinge Loss Equation

def Hinge(yhat, y):
 return np.max(0,1 - yhat * y)

Where y is the actual label (-1 or 1) and ŷ is the prediction

The loss is 0 when the signs of the labels and prediction match.

And like always, it’s just for another task!

The actual label is -1
And we want to consider the prediction of: [0.3,-0.8,-1.1,-1,1]

Let’s start!

max[0,1-(-1⋅ 3)] = max[0, 1.3] = 1.3 -> Loss is High
max[0,1-(-1⋅ -0.8)] = max[0, 0.2] = 0.2-> Loss is Low
max[0,1-(-1⋅ -1.1)] = max[0, -0.1] = 0 -> No Loss!!!
max[0,1-(-1⋅ -1)] = max[0, 0] = 0-> No Loss!!!
max[0,1-(-1⋅ 1)] = max[0, 2] = 2 -> Loss is very High!!!

Triplet Ranking Loss

The Triplet Ranking Loss is very familiar to the Hinge Loss but this time triplets rather than pairs. So each input consists of triplets!

The sample above consists of triplets(e.g. three images) rather than pairs. One image is the reference (anchor) image: Iₐ, another is a posivie image Iₚ which is similar (or from the same class) as the anchor image, and the last image is a negative image Iₙ, which is dissimilar (or from a different class) from the anchor image. These three images are fed as a single sample to the network.

Triplet Ranking Loss Equation

The Objective is to Minimize the distance between the anchor and the positive image and maximize it between the anchor and the negative image.

Another example of the Triplet Ranking Loss

There are 3 situations:

First situation

max(0, negative value) =0 -> No Loss. Distance of the negative sample is far from the anchor. Perfect!

Second situation

max(0, m+postivie value) = m + positive value -> Loss is greater than m. The negative sample is closer to the anchor than the positive. BAD!!

Third situation

Triplets where the negative is not closer to the anchor than the positive, but which still have positive loss. In that case, they are at the margin, and the loss is m. Okay but we encourage it to be better (further from the margin).

Training with Easy Triplets should be avoided, since their resulting loss will be 0. Therefore, it is crucial on how to choose the triplet images.

How should we choose the triplets?

Online Triplet mining: Triplets are defined for every batch during the training. This results in better training efficiency and performances than offline mining (choosing the triplets before training).

Selecting Negative Examples (Hard Negatives)

Hard Negatives are negative data points that are the worse within mini-batch. They are the False Positive: Points that are predicted as positive while they are actually negative. These false positives are called hard negatives, and the process of selecting them is called Hard Negative Mining.

Train for one or more epochs to find the hard negatives
Congratulations, you found the hard negatives data! Now add these Negative to the training set and re-train the model.
And because of that your network will performance will be better and doesn’t predict such false positives.

Wow! It was hard and long!! I’m proud of you for going with the journey with me, the journey of loss functions. At least now the professor will know that you listened in class and will even give you extra credits for solving his personal question!!!

I heard that the next class is going to be in a week or two so take a rest and relax. Visit your family, go to the park, meet new friends or do something else. Do things that make you happy since you learned a lot and you need some rest!! See you in a week or two!!

What now?

You can stick with me, as I’ll publish more and more blogs, guides and tutorials.
Until the next one, have a great day!
Tomer

Where to find me:
Artificialis: Discord community server , full of AI enthusiasts and professionals
Newsletter, weekly updates on my work, news in the world of AI, tutorials and more!
Our Medium publication: Artificial Intelligence, health, life. Blogs, articles, tutorials. all in one.

Don’t forget to give us your 👏 !