Loss Function| The Secret Ingredient to Building High-Performance AI Models

Neha Purohit

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

8 min readSep 27, 2023

Introduction

Both Loss functions and Optimizers are pivotal components that collaborate to facilitate effective model training.

As seen last week, Optimizers are specialized algorithms used to reduce the chosen loss function’s value during model training. They improve the efficiency and convergence of optimization by incorporating elements such as adaptive learning rates and momentum.

As seen above, loss function takes the following two parameters:

Predicted output (y’)

Target value (y)

This will determine the performance of the model. Depending on the deviation between y’ and y, if it is large, the loss will be large. If the deviation is small or the value of y’ and y are identical, the loss will be small or negligible.

This is how loss function is used to penalize a model properly when it is training on the provided dataset.Loss functions change based on the problem statement that the algorithm is trying to solve.

Loss functions are essential tools in machine learning and optimization. They assess the error between a model’s predictions and actual target values, indicating how well the model aligns with desired outcomes. “Loss” signifies the penalty incurred when the model fails to meet expectations. These functions guide model training, enabling parameter adjustments to minimize errors and improve predictive accuracy.

History

The history of loss functions in machine learning can be traced back to the 18th century, when Pierre-Simon Laplace introduced the concept. However, loss functions did not become widely used in machine learning until the 20th century.

In the 1980s and 1990s, new loss functions were developed for specific machine learning tasks, such as classification and regression. In recent years, the development of new loss functions has accelerated, due in part to the rise of deep learning, to help train these models more effectively.

Categories

Loss functions can be broadly categorised into three types:

1) Regression Losses: These are used when the model’s goal is to predict continuous values, like estimating a person’s age. Metrics used in Regression losses:

Mean Absolute Error Loss (MAE)

It calculates the mean of the absolute discrepancies between predicted and actual values, typically applied in regression scenarios to impose penalties in proportion to the absolute size of errors. Formula for MAE is:

Here, x represents the actual value and y the predicted value.

It can also be explained as below:

Mean Squared Error Loss (MSE)

It calculates the mean of the squared disparities between predicted and actual values, frequently utilized in regression applications, particularly when you intend to impose substantial penalties for significant errors. Formula for MSE is:

Here, x represents the actual value and y the predicted value.

It can also be explained as below:

Above, the purple dots are the points on the graph, and have an x-coordinate and a y-coordinate.

The blue line is our prediction line that passes through all the points, and fits them in the best way. This line contains the predicted points.

The red line between each purple point and the prediction line are the errors. Each error is the distance from the point to its predicted point.

Classification Losses: These come into play when the model is making predictions for discrete values, such as determining whether an email is spam or not.

Negative Log-Likelihood Loss

Negative Log-Likelihood (NLL) Loss is closely related to the concept of maximum likelihood estimation (MLE) and is often used to optimize models that predict probabilities for different classes. Uses of NLL are in Probability Prediction, True Class Label, Loss Calculation, Summation. Formula for NLL is:

NLL is the negative log-likelihood.

n is the number of data points in the data set

C is the number of classes in the classification problem

Yij is the binary indicator (0 or 1) for class j if it is correct indicator of i and vice versa

Pij is the predicted probability that data point i belongs to class j

It is depicted below:

Cross-Entropy Loss

Unlike other loss functions, such as squared loss, Cross-Entropy strongly penalizes overly confident but incorrect predictions. It also penalizes correct predictions with less confidence. Binary Cross-Entropy (BCE) is a common variant used in binary classification models with just two classes. This is used over imbalance data, probabilistic modeling etc.

Binary cross-entropy loss:

Binary cross-entropy, also referred to as logarithmic loss or log loss, serves as a model metric that monitors the model’s misclassification of data class labels. It imposes penalties on the model when deviations in probability lead to incorrect labeling decisions.

Historically, this loss function has a foundation in logistic regression, which emerged in the mid-20th century. It was introduced as a means to measure the disagreement between predicted probabilities and actual binary outcomes. The concept of log loss is rooted in information theory and entropy, particularly Claude Shannon’s work. Over time, it transcended logistic regression and became a fundamental loss function in various machine learning applications, including neural networks and deep learning, for binary classification tasks. Its historical roots in statistical modeling, likelihood estimation, and information theory have contributed to its widespread use in modern machine learning.

Formula:

Where x is the input, y is the target, w is the weight, C is the number of classes, and N spans the mini-batch dimension. It is also denoted in below graphical formation.

Categorical Cross-Entropy loss:

Categorical Cross-Entropy, also known as Softmax Loss, is a loss function commonly used for multiclass classification in neural networks. It combines softmax activation with Cross-Entropy loss to generate predicted probability distributions across multiple classes for each input. In multiclass classification, where one-hot encoding is typically employed, only the true class contributes to the loss calculation, ensuring that the loss focuses on the correct class prediction.

Formula:

Ranking Losses: These are used in situations where the model’s objective is to predict the relative distances or rankings between inputs, such as ranking products by relevance in an e-commerce search.

Ranking Loss functions

Hinge Loss

Hinge Loss is commonly used for ranking and support vector machine (SVM) applications. It’s particularly useful when you want to encourage a margin between the predicted scores of positive and negative samples, loss calculations, margin concepts etc.

Formula:

Where x is the true binary label for the data point (+1 or -1, where +1 represents the positive class and -1 represents the negative class). Max function is the model’s prediction, often a real-valued score or output before applying a threshold.

Visually it can be shown as:

Margin Ranking Loss

Margin loss functions are commonly used for tasks that involve learning representations or embeddings, such as triplet loss and margin ranking loss. These loss functions aim to create a margin or gap in the learned representations to improve the separation between different classes or samples.

Formula:

With the Margin Ranking Loss, you can calculate the loss provided there are inputs x1, x2, as well as a label tensor, y (containing 1 or -1). When y == 1, the first input will be assumed as a larger value. It’ll be ranked higher than the second input. If y == -1, the second input will be ranked higher.

Triplet Margin Loss

Triplet Margin Loss is a popular margin loss function used in tasks like face recognition, image similarity, and recommendation systems. It uses carefully selected triplets of data points, each consisting of an anchor, a positive, and a negative example. The goal is to train a model to minimize the distance or dissimilarity between the anchor and positive while maximizing the distance or dissimilarity between the anchor and negative. This encourages the model to better distinguish between similar and dissimilar data points, enhancing its performance in similarity-based tasks. Triplet selection focuses on challenging triplets to accelerate learning, ultimately improving the model’s ability to make accurate similarity judgments.

Formula:

Kullback-Leibler Divergence:

Kullback-Leibler Divergence (KL Divergence) measures the difference between two probability distributions. It quantifies information loss in bits when using the predicted distribution to approximate the true distribution. A higher KL Divergence indicates greater dissimilarity between distributions, while zero implies they are identical. Unlike Cross-Entropy Loss, KL Divergence doesn’t consider prediction confidence and focuses solely on distribution differences.

Formula:

source:https://jessicastringham.net/2018/12/27/KL-Divergence/

These are a few examples of the various loss functions employed in deep learning. Each loss function has its strengths and weaknesses, and the choice depends on the specific task, data characteristics, and desired model behavior. Experimentation and careful consideration of the problem at hand are key to selecting an appropriate loss function for a successful deep learning model.

If you enjoy reading stories like these and want to support my writing, please consider Follow and Like . I’ll cover most deep learning topics in this series.These posts are also shared on X.

Thank you!

Loss Function| The Secret Ingredient to Building High-Performance AI Models

Introduction

History

Categories

Written by Neha Purohit