Objective Functions in Deep Learning

Mustafa Qamaruddin
Sci-Net
Published in
4 min readApr 30, 2019
Videocast Objective Functions in Deep Learning

In this report, I shall summarize the objective functions ( loss functions ) most commonly used in Machine Learning & Deep Learning. I have given a priority to loss functions implemented in both Keras and PyTorch since it sounds like a good reflection of popularity and wide adoption. For each loss function, I shall provide the formula, the pros, and the cons. Loss functions can be grouped into two categories based on the inference task; Regression and Classifications. However, it is not exclusive since some loss functions can be used with slight or no modification with one task or the other.

Regression Objective Functions:

  • Mean Absolute Error ( MAE )
L1-norm

It is a quite simple objective function, but it lacks stability and robustness.

  • Mean Square Error ( MSE )
L2-Norm

MSE penalizes the errors by squaring them which acts as an exaggeration factor in order to force the model to learn from previous mistakes. However, MSE is highly sensitive to outliers and there are some modifications that suggest a solution to outliers.

  • Huber
Huber Loss

This method is less sensitive to outliers since it only squares the difference if it’s below a predefined threshold delta.

  • Poisson
Poisson Loss
  • Cosine
Cosine Loss

This is very similar to cosine similarity and works by minimizing the dot product between the output vector and the ground truth vector. The value 0 indicates maximum similarity or parallel vectors and the value +1 indicates maximum dissimilarity or orthogonal vectors.

Classification Objective Functions:

  • Hinge
Hinge Loss

Most commonly used for optimizing Support Vector Machine ( SVM ) models, but it suffers from the fact that its derivative is discontinuous at j = y_{i}, and that’s a why a variant was introduced that squares the difference in order to introduce a continuous derivative.

  • Binary Cross Entropy
Log Loss

The special case of log loss when the target is a binary classification value. It’s based on cross-entropy between two probabilities. A large cross entropy indicates divergence and a small cross entropy indicates similarity. It converges much faster than MSE and its derivative has favorable properties such as being either to compute and use with non-linear activations.

  • Categorical Cross Entropy
Log Loss

The extension of cross-entropy to the multi-label classification problem.

  • Kullback-Leibler
KL Loss
KL Loss

It minimizes the distance between two probability distribution and it’s mostly used in Generative Adversarial Networks ( GANs ). It converges by making the probability distribution of the predicted output and the ground truth very close to each other. It’s based on concepts from Shanon information theory and cross entropy.

  • Negative Logarithmic Likelihood
NLL Loss

It’s very similar to log loss ( cross entropy ) and it is equivalent to maximizing the probability that a given sample is generated from the target class distribution. It’s also equivalent to Maximum A Posteriori MAP estimation in Bayesian inference.

  • Cauchy-Schwarz
Cauchy-Schwarz Loss

According to On Loss Functions for Deep Neural Networks in Classification, Katarzyna Janocha et al., 2017, it was found that Cauchy-Schwarz divergence, when used as an objective function on both MNIST and CIFAR-10, performs better than Log Loss.

Conclusion

The choice of the objective function has an impact on the speed of learning and the overall inference accuracy. Each learning task requires an adapted objective function such as regression, classification, super-resolution, style transfer, or generation. In practice, one should benchmark results obtained from different objective functions and fine-tune the hyper-parameters that best suit the task at hand.

According to Support Vector Machines by Andreas ChristmannIngo Steinwart the loss functions can be categorized as margin-based or distance-based losses. It dives deeper into the convexity, concavity, and continuity properties of each objective function and how it affects the statistical learning process.

Check out my course on PacktPub:
Machine Learning for Algorithmic Trading Bots with Python
https://www.packtpub.com/application-development/machine-learning-algorithmic-trading-bots-python-video

Support Vector Machines, Steinwart et al.:
https://www.springer.com/gp/book/9780387772417

On Loss Functions for Deep Neural Networks in Classification, Janocha et al., 2017:
https://arxiv.org/abs/1702.05659

--

--