Loss Functions in Machine Learning

A small tutorial or introduction about common loss functions used in machine learning, including cross entropy loss, L1 loss, L2 loss and hinge loss. Practical details are included for PyTorch.

Benjamin Wang
The Startup
4 min readJan 13, 2021

--

Cross Entropy

Cross entropy loss is commonly used in classification tasks both in traditional ML and deep learning.

Image from this post

Note: logit here is used to refer to the unnormalized output of a NN, as in Google ML glossary. However, admittedly, this term is overloaded, as discussed in this post.

In this figure, the raw unnormalized output from a Neural Network is converted into probability by a softmax function.

Image from this post

Now suppose we have a training sample which is a dog, the target label would be [1,0,0,0] , while the NN raw output is [3.2, 1.3, 0.2, 0.8] , the softmax probability output is [0.775, 0.116, 0.039, 0.07] , what would be the cross entropy loss?

Plugging in the values, we have the loss value:

Note: natural log is used as common practice.

After some weights update, we have new raw outputs for the same sample, the softmax probability becomes [0.9, 0.05, 0.03, 0.02] , the new loss value is:

This new loss is lower than previous loss, indicating NN is learning. Intuitively, we can also observe that the softmax probability is closer to the the true distribution.

The perfect loss will be 0, when the softmax outputs perfectly matches the true distribution. However, that would mean extreme overfitting.

Another practical note, in Pytorch if one uses the nn.CrossEntropyLoss the input must be unnormalized raw value (aka logits), the target must be class index instead of one hot encoded vectors.

See Pytorch documentation on CrossEntropyLoss.

The same pen and paper calculation would have been

Note the the input is the raw logits, and the target is the class index ranging from 0 to 3 in this case, representing dog, cat, horse, cheetah respectively.

Also in this example, we only considered a single training sample, in reality, we normally do mini-batches. And by default PyTorch will use the average cross entropy loss of all samples in the batch.

One might wonder, what is a good value for cross entropy loss, how do I know if my training loss is good or bad?

Some intuitive guidelines from MachineLearningMastery post for natural log based for a mean loss:

  • Cross-Entropy = 0.00: Perfect probabilities.
  • Cross-Entropy < 0.02: Great probabilities.
  • Cross-Entropy < 0.05: On the right track.
  • Cross-Entropy < 0.20: Fine.
  • Cross-Entropy > 0.30: Not great.
  • Cross-Entropy > 1.00: Terrible.
  • Cross-Entropy > 2.00 Something is broken.

Binary cross entropy is a special case where the number of classes are 2. In practice, it is often implemented in different APIs. In PyTorch, there are nn.BCELoss and nn.BCEWithLogitsLoss . The former requires the input to be normalized sigmoid probability, whereas the latter can take raw unnormalized logits.

Root Mean Squared Error and others

MAE is for Mean Absolute Error, MSE for Mean Squared Error and RMSE for Root Mean Squared Error.

This post provides a concise overview of MAE, MSE and RMSE.

MAE is also known as L1 Loss, and MSE is also known as L2 Loss.

Hinge loss

Hinge loss is commonly used for SVM.

Image source: post

This loss is used for max margin classifier, such as SVM. Suppose the boundary is at origin:

  • If an instance is classified correctly and with sufficient margin (distance > 1), the loss is set to 0
  • If an instance is classified correctly but very close to the margin (0<distance<1)
  • If an instance is miss classified, a positive loss is used as penalty proportional to the distance
From Wikipedia

where t represents the true label (in SVM, we use -1 and +1 to label two different classes), y represents the prediction score (raw output of the model, similar to logit), which can be intuitively understood as how far the prediction is from the boundary.

If both t and y are of the same sign (t*y > 0), the classification results are correct. however, as explained, for those 0<t*y <1, we still induce a small penalty because they are too close to the boundary.

If t*y < 0, meaning the classification is wrong, a positive loss is used as penalty proportional to the distance y.

Notes

  1. Cross entropy part of this blog draws many inspirations from: https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e
  2. Hinge loss part uses https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1 as reference

--

--

Benjamin Wang
The Startup

Machine Learning & Software Engineer in Amsterdam, Holland