Overview of loss functions for Machine Learning

Published in

Analytics Vidhya

6 min readFeb 17, 2021

Machine learning is, at its core, an optimization problem. With any optimization problem, we need ways to calculate how far our predictions are from the truth to determine in what direction we need our model to change.

We do this by minimizing the loss and the cost function of a model.

Loss functions take predictions and compare them with the actual value or label of the data by outputting an error metric. This error determines how the weights of the neuron should shift. The cost function is the average loss across the entire sample of data and predictions.

Classification problems and Regression problems have different types of loss functions because of the nature of their outputs. Classification problems are when the model must label the input as a class correctly. Regression functions predict numbers. Based on these two examples, no loss function works for every type of input data.

Regression Loss

MSE (Mean Squared Error/ Squared Error Loss)

Also known as L2(Ridge) Loss in Machine Learning

Like its name, it is the sum of the squared distance between the actual value and the prediction. Linear Regression uses MSE.

If data is prone to outliers, the MSE may not be a suitable choice since squaring a large outlier will make the loss very large. Disproportionately large numbers affect the gradient descent’s stability.

MSE is also not suited to logistic regression since the data may not be completely convex.

MAE(Mean of Absolute Error)

-Also known as the L1(Lasso) Loss

It is the absolute value of the distance between the predicted and the actual values.

It is also more robust than MSE but using the absolute value in other mathematical equations is not easy. (It complicates the derivative)

As this graph illustrates, the MSE gradient is less than the MAE gradient when the error is below 1.

Huber Loss/ smooth mean absolute error

Combines MSE and MAE so that the function acts quadratically for small errors and linearly for large errors (shifted by delta)

It approaches the absolute error as delta approaches zero and the mean squared error as delta approaches infinity.

Although Huber loss combines the best aspects of MSE and MAE, delta becomes another hyperparameter you need to train.

It is used in Robust Regression, M-estimation, and Additive Modeling as well as classification

Log Cosh Loss

Log Cosh loss works similarly to the mean squared error with resistance to outliers. Unlike Huber, it also continuously twice differentiable- what some ML models require in their calculations.

It has trouble with incorrect predictions that are large in value.

Quantile Loss

Uses the range of predictions to predict an interval. It does not require a constant variance or normal distribution to have an informative prediction interval.

It estimates the quantile of a dependent variable based on the value of the independent variable. It becomes the MAE function when the quantile is the 50th percentile.

Depending on what percentile you choose as the quantile, it penalizes overestimation or underestimation.

Binary Classification Loss

We assign an object to one of two classes. One example of Binary classification is labeling an image as one that contains a cat or not.

Binary Cross Entropy Loss (Log Loss)

Since we map the output to the probability space, we can measure the loss by what probability the model says the input is in a class.

Cross-entropy is calculated using the probabilities of the events from P and Q, as follows:

A large entropy means that there is more uncertainty in the model. We wish to minimize this function since entropy measures how uncertain we are with a given distribution. But we do not know the actual distribution of what we are predicting- the Kullback-Leibler Divergence calculates the loss based on an assumed distribution.

Binary cross-entropy Loss is often grouped together with Log-Loss. At its core, it also uses the negative log of the Bernoulli distribution for its maximization problem. It is computationally similar to minimizing the negative log-likelihood, although this does not mean that log-loss and cross-entropy loss are exactly the same.

The activator used with this loss is often the Sigmoid function since its output is between [0,1], which is easy to split into a binary classification→ above 0.5 is one class and below is the other.

Hinge Loss

Hinge loss is used for Support Vector Machines and classifies with -1 and 1 rather than 0 and 1. As a result, the last activation layer of the network must be a hyperbolic tangent, Tanh.

This loss function acts like Binary Cross-Entropy, except that it also penalizes predictions that are labeled correctly but are not confident. (low probability in regression, but only to a certain point)

It does this by calculating the distance between the actual value of the data and the predicted linear boundary.

For example, the hinge error is zero once a prediction is accurate with confidence and linearly increasing otherwise.

Intuitively, it uses differences in the sign of the classification to calculate the loss. The error is larger if the predicted value has the opposite sign of the actual value.

There is also an extension of this loss called the square hinge loss and cubed hinge loss. These smooth the error function and make it numerically easier to compute differentials, although it may not perform better.

Despite its features, this loss does not reliably perform better than cross-entropy.

Multi-class Classification Loss

Categorical/Multi-Class Cross Entropy Loss

Categorical cross entropy uses one-hot vectors to generalize from binary cross entropy loss.

Each class has a class label of probability 1 for itself and 0 for all other class labels.

When there are thousands or hundreds of thousands of labels and not so much data, this becomes a sparse multi-class cross entropy problem. In this case, the target variable is sometimes not one-hot encoded for training to save memory.

It is used with the Softmax activator- which means the classes are not independent since the output of this activator are not independent.

KL- Divergence

The KL Divergence measures how different a probability distribution is from another distribution. Lower divergence means the distribution is closer to the actual distribution.

It approximates complex functions more than classifications, such as auto-encoders that must learn dense features.

In comparison to cross-entropy loss, it measures the relative difference between two probability distributions rather than the total entropy between the distribution- which is why KL divergence is also sometimes called the relative entropy.

It is also used in deep-generative models like Variational auto-encoders.( VAE)

References: