Objective Functions in Deep Learning

Mustafa Qamaruddin
Apr 30, 2019 · 4 min read
Videocast Objective Functions in Deep Learning

In this report, I shall summarize the objective functions ( loss functions ) most commonly used in Machine Learning & Deep Learning. I have given a priority to loss functions implemented in both Keras and PyTorch since it sounds like a good reflection of popularity and wide adoption. For each loss function, I shall provide the formula, the pros, and the cons. Loss functions can be grouped into two categories based on the inference task; Regression and Classifications. However, it is not exclusive since some loss functions can be used with slight or no modification with one task or the other.

Regression Objective Functions:

  • Mean Absolute Error ( MAE )
Image for post
Image for post
L1-norm

It is a quite simple objective function, but it lacks stability and robustness.

  • Mean Square Error ( MSE )
Image for post
Image for post
L2-Norm

MSE penalizes the errors by squaring them which acts as an exaggeration factor in order to force the model to learn from previous mistakes. However, MSE is highly sensitive to outliers and there are some modifications that suggest a solution to outliers.

  • Huber
Image for post
Image for post
Huber Loss

This method is less sensitive to outliers since it only squares the difference if it’s below a predefined threshold delta.

  • Poisson
Image for post
Image for post
Image for post
Image for post
Poisson Loss
  • Cosine
Image for post
Image for post
Cosine Loss

This is very similar to cosine similarity and works by minimizing the dot product between the output vector and the ground truth vector. The value 0 indicates maximum similarity or parallel vectors and the value +1 indicates maximum dissimilarity or orthogonal vectors.

Classification Objective Functions:

  • Hinge
Image for post
Image for post
Hinge Loss

Most commonly used for optimizing Support Vector Machine ( SVM ) models, but it suffers from the fact that its derivative is discontinuous at j = y_{i}, and that’s a why a variant was introduced that squares the difference in order to introduce a continuous derivative.

  • Binary Cross Entropy
Image for post
Image for post
Log Loss

The special case of log loss when the target is a binary classification value. It’s based on cross-entropy between two probabilities. A large cross entropy indicates divergence and a small cross entropy indicates similarity. It converges much faster than MSE and its derivative has favorable properties such as being either to compute and use with non-linear activations.

  • Categorical Cross Entropy
Image for post
Image for post
Log Loss

The extension of cross-entropy to the multi-label classification problem.

  • Kullback-Leibler
Image for post
Image for post
KL Loss
Image for post
Image for post
KL Loss

It minimizes the distance between two probability distribution and it’s mostly used in Generative Adversarial Networks ( GANs ). It converges by making the probability distribution of the predicted output and the ground truth very close to each other. It’s based on concepts from Shanon information theory and cross entropy.

  • Negative Logarithmic Likelihood
Image for post
Image for post
NLL Loss

It’s very similar to log loss ( cross entropy ) and it is equivalent to maximizing the probability that a given sample is generated from the target class distribution. It’s also equivalent to Maximum A Posteriori MAP estimation in Bayesian inference.

  • Cauchy-Schwarz
Image for post
Image for post
Cauchy-Schwarz Loss

According to On Loss Functions for Deep Neural Networks in Classification, Katarzyna Janocha et al., 2017, it was found that Cauchy-Schwarz divergence, when used as an objective function on both MNIST and CIFAR-10, performs better than Log Loss.

Conclusion

The choice of the objective function has an impact on the speed of learning and the overall inference accuracy. Each learning task requires an adapted objective function such as regression, classification, super-resolution, style transfer, or generation. In practice, one should benchmark results obtained from different objective functions and fine-tune the hyper-parameters that best suit the task at hand.

According to Support Vector Machines by Andreas ChristmannIngo Steinwart the loss functions can be categorized as margin-based or distance-based losses. It dives deeper into the convexity, concavity, and continuity properties of each objective function and how it affects the statistical learning process.

Image for post
Image for post

Check out my course on PacktPub:
Machine Learning for Algorithmic Trading Bots with Python
https://www.packtpub.com/application-development/machine-learning-algorithmic-trading-bots-python-video

Support Vector Machines, Steinwart et al.:
https://www.springer.com/gp/book/9780387772417

On Loss Functions for Deep Neural Networks in Classification, Janocha et al., 2017:
https://arxiv.org/abs/1702.05659

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store