Loss Functions Unraveled

10 min readAug 15, 2023

Part 3: Classification Loss Functions

Classification loss functions are used for classification tasks, where the goal is to predict a class label.

Binary Classification Loss Functions

It is used in a classification problem where you classify an example as belonging to one of two classes.

Binary Cross-Entropy:

Cross-Entropy Loss is also known as Log Loss. Entropy is the measure of uncertainty in a certain distribution, and cross-entropy is the value representing the uncertainty between the target distribution and the predicted distribution. Log loss measures the difference between the predicted probabilities and the true class labels. The cross-entropy loss between two probability distributions y and t over the same event space is defined as:

An example of log loss calculation would be:

Suppose we have a binary classification problem where the positive class is “cancer” and the negative class is “no cancer.” We have a sample with true label y=1 (cancer) and the model predicts a probability of p=0.9 that the sample belongs to the positive class. Then, the log loss for this sample would be:

CE(y, t) = -(1 * log(0.9) + (1–1) * log(1–0.9)) = -log(0.9) = 0.105

This means that the model’s prediction is incorrect, but not very far from the true label, so the log loss is not very large. On the other hand, if the model predicts a probability of p=0.1 for this sample, the log loss would be much larger:

CE(y, t) = -(1 * log(0.1) + (1–1) * log(1–0.1)) = -log(0.1) = 2.302

This means that the model’s prediction is far from the true label, so the log loss is much larger. The goal of training a model with log loss as the loss function is to minimize the average log loss over all samples, so that the model’s predictions are as close as possible to the true class labels.

The ideal value of Binary Cross-Entropy is zero. closer the value to zero, better the model is performing.

Hinge Loss: It is a popular loss function used in machine learning, particularly in the context of binary classification tasks. Hinge loss is notably used for training Support Vector Machines (SVMs). In SVM, data points are categorized into either a positive class (+1) or a negative class (-1). The primary goal of SVMs is to find a decision boundary(hyperplane) that maximizes the margin between the two classes while minimizing the misclassification of data points. It’s important to note that SVMs strive for a clear separation between classes. Contrary to methods like linear regression, where we try to find a line that minimizes the distance from the data points, an SVM tries to maximize the distance. The distance from the hyperplane can be regarded as a measure of confidence. The further an observation lies from the plane, the more confident it is in the classification. Hinge loss serves as the mathematical foundation for achieving this objective. The hinge loss is defined as follows: loss(y, y_pred) = max(0, 1 — y * y_pred)

Here, y is the true label of the data point (+1 or -1), and y_pred is the predicted label based on the SVM’s output. The hinge loss has the following conditions:

If the predicted output y_pred and the true output y have the same sign, meaning they are both positive or both negative, the loss is 0. This indicates that the data point is classified correctly and lies on the correct side of the decision boundary. In simple words, If the prediction is accurate and on the correct side of the decision boundary (i.e., y * y_pred is positive), the loss is 0. The SVM doesn’t penalize correctly classified points that are confidently classified.
If the predicted output y_pred and the true output y have different signs, meaning one is positive and the other is negative, the loss is proportional to the distance between the data point and the decision boundary. The larger the distance, the larger the loss. In simple words, If the prediction is incorrect or on the wrong side of the boundary (i.e., y * y_pred is negative), the loss is proportional to the distance between the data point and the decision boundary. The larger the distance, the larger the penalty.

Let’s consider some examples to understand how hinge loss works:

Example 1:

True output (y): +1
Predicted output (f(x)): 0.5

Since the true output and the predicted output have the same sign, the loss is 0. The data point is classified correctly and lies on the correct side of the decision boundary.

Example 2:

True output (y): -1
Predicted output (f(x)): 2.5

Since the true output and the predicted output have different signs, the loss is proportional to the distance between the data point and the decision boundary. In this case, the loss is 1.5 (1 — (-1 * 2.5)).

Example 3:

True output (y): +1
Predicted output (f(x)): -1.8

Again, the true output and the predicted output have different signs, so the loss is proportional to the distance between the data point and the decision boundary. In this case, the loss is 2.8 (1 — (1 * (-1.8))).

By minimizing the hinge loss, SVM aims to find the decision boundary that separates the classes with the largest possible margin while correctly classifying the data points. let’s break down the different conditions of hinge loss.

Condition 1: Correct Classification with a Large Margin

When an observation is correctly classified by the SVM and is situated far away from the decision boundary (hyperplane), no penalty is incurred. This reflects high confidence in the classification.

Let’s go through an example:

Actual Outcome: +1

Predicted Output: +3

Margin: 1

Hinge Loss: max(0, 1 — y * y_pred) = max(0, 1–1 * 3) = max(0, -2) = 0

Condition 2: Observation on the Decision Boundary.

Data points that lie directly on the decision boundary are treated as a special case. Regardless of the actual outcome (+1 or -1), these observations incur a loss of 1.

Example:

Actual Outcome: +1

Margin: 1

Predicted Output: 0 (On the decision boundary)

Hinge Loss: max(0, 1 — y * y_pred) = max(0, 1–1 * 0) = max(0, 1) = 1

Condition 3: Correct Classification within the Margin

Observations that fall on the correct side of the decision boundary but within the margin incur a cost between 0 and 1. This represents a certain level of uncertainty.

Example:

Actual Outcome: +1

Predicted Output: +0.2

Margin: 1

Hinge Loss: max(0, 1 — y * y_pred) = max(0, 1–1 * 0.2) = max(0, 0.8) = 0.8

Condition 4: Incorrect Classification

Data points that end up on the wrong side of the decision boundary experience a hinge loss greater than 1, which increases linearly. The larger the distance from the hyperplane, the larger the loss.

Example:

Actual Outcome: +1

Predicted Output: -2.5

Hinge Loss: max(0, 1 — y * y_pred) = max(0, 1–1 * (-2.5)) = max(0, 1 + 2.5) = 3.5

Here, the actual outcome is +1, indicating that the data point belongs to the positive class. The predicted output is -2.5, which means the data point is misclassified and falls on the wrong side of the decision boundary. In this case, the hinge loss is 3.5, indicating a significant loss due to the misclassification of the data point. The larger the distance from the correct side of the decision boundary, the larger the hinge loss.

In Summary, only if an observation is classified correctly and the distance from the plane is larger than the margin will it incur no penalty.

Python Implementation:

import numpy as np
from sklearn import svm
from sklearn.metrics import accuracy_score, hinge_loss
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate example data
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train SVM model with hinge loss
svm_model = svm.SVC(kernel='linear', C=1.0)
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Calculate accuracy and hinge loss
accuracy = accuracy_score(y_test, y_pred)
h_loss = hinge_loss(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Hinge Loss: {h_loss:.2f}")

Result:

Accuracy: 0.95
Hinge Loss: 0.60

In the context of Support Vector Machines (SVMs), the ideal value of hinge loss is generally as low as possible, ideally reaching zero. A hinge loss of zero indicates that all data points are correctly classified with a comfortable margin and without any misclassification. It’s important to note that the focus of SVMs is not solely on minimizing the hinge loss; rather, it’s about finding the optimal decision boundary (hyperplane) that maximizes the margin while controlling the hinge loss. SVMs aim to strike a balance between correctly classifying data points and maximizing the margin, with the hinge loss serving as a guiding principle for this balance.

Multi-Class Classification loss functions

It is used in a classification problem where you classify an example as belonging to one of more than two classes.

Categorical cross-entropy or multiclass log loss:

These are similar to binary classification cross-entropy, used for multi-class classification problems.

For example, here’s the formula for Categorical Cross-Entropy (CCE) loss in a multi-class classification problem with three classes:

Loss = - (1/N)* Σ[ y1 *log⁡(p1)+ y2 *log⁡(p2)+ y3 *log⁡(p3) ]

where N is the number of samples, y1, y2, and y3 are the true class labels (encoded as one-hot vectors), and p1, p2, and p3 are the predicted probabilities for each of the three classes for the ith sample.

Sparse categorical cross-entropy:

Sparse categorical cross-entropy is an extension of the categorical cross-entropy loss function. It is appropriate when the ground truth labels are provided as integers rather than one-hot encoded vectors. Consider a classification problem with five classes: “cat,” “dog,” “bird,” “fish,” and “rabbit.”

In traditional one-hot encoding, the labels would be represented as follows:

“cat” -> [1, 0, 0, 0, 0]
“bird” -> [0, 0, 1, 0, 0]
“dog” -> [0, 1, 0, 0, 0] ….

However, with sparse categorical cross-entropy, we don’t need to one-hot encode the labels. Instead, the labels are provided as integers representing the class index directly. For example:

“cat” -> Label: 0
“bird” -> Label: 2
“dog” -> Label: 1 ….

The “sparse” in Sparse Categorical Cross Entropy refers to the fact that the target classes are represented as integers, rather than one-hot vectors. In a one-hot vector, only one element is “hot” or marked as true (1), while the rest are “cold” or marked as false (0). For example, for five classes, the third class would be represented as [0, 0, 1, 0, 0]. If we represent this in a sparse format, we would simply use the integer ‘2’ (considering a zero-based index).

Let’s derive the formula for sparse categorical cross-entropy from the formula for categorical cross-entropy. As discussed, the formula for categorical cross-entropy is:

Categorical Cross-Entropy = -Σ[one_hot(y_true) * log(y_pred)]

Where:

Σ represents summation (the sum is taken over all samples).
y_true is the true one-hot encoded label vector representing the class of each sample.
y_pred is the predicted probability vector over all classes for each sample.

Example:

Sample 1:

True label: “red” -> Label: [1, 0, 0]
Predicted probabilities: [0.9, 0.05, 0.05]

Sample 2:

True label: “blue” -> Label: [0, 0, 1]
Predicted probabilities: [0.1, 0.1, 0.8]

Sample 3:

True label: “green” -> Label: [0, 1, 0]
Predicted probabilities: [0.3, 0.4, 0.3]

Now, let’s calculate the loss for each sample:

Sample 1:

Loss: −(1⋅log(0.9)+0⋅log(0.05)+0⋅log(0.05)) = − log(0.9) ≈ 0.046

Sample 2:

Loss: −(0⋅log(0.1)+0⋅log(0.1)+1⋅log(0.8)) = − log(0.8) ≈ 0.097

Sample 3:

Loss: −(0⋅log(0.3)+1⋅log(0.4)+0⋅log(0.3)) = − log(0.4) ≈ 0.398

Now, let’s consider the case of sparse categorical cross-entropy. In sparse categorical cross-entropy, instead of providing the true probabilities (y_true) for each class, we provide the true class labels directly as integers. The formula for sparse categorical cross-entropy is:

Loss = — Σ[log(p_i, y_i)]

Where:

N is the number of samples.
p_i, y_i is the predicted probability of the true class label y_i for the i-th sample.

To break it down further, let’s consider an example with three samples and their corresponding true labels and predicted probabilities:

Sample 1:

True label: “red” -> Label: 0
Predicted probabilities: [0.9, 0.05, 0.05]

Sample 2:

True label: “blue” -> Label: 2
Predicted probabilities: [0.1, 0.1, 0.8]

Sample 3:

True label: “green” -> Label: 1
Predicted probabilities: [0.3, 0.4, 0.3]

Now, let’s calculate the sparse categorical cross-entropy loss for these three samples using the formula:

Sample 1:

Loss: − log(0.9) ≈ 0.046

Sample 2:

Loss: − log(0.8) ≈ 0.097

Sample 3:

Loss: − log(0.4) ≈ 0.398

As you can see, in this specific example, the results of both loss functions are the same for each sample. This is because the true labels are binary vectors (one-hot encoded vectors) in the categorical cross-entropy, and when calculating the loss for each sample, only the corresponding element of the predicted probabilities contributes to the loss (while all other elements are multiplied by 0).

In summary, sparse categorical cross-entropy is a variant of the categorical cross-entropy loss function used in multiclass classification tasks. The key distinction is that it accepts integer labels directly, eliminating the need for one-hot encoding the target labels. This can be particularly advantageous when dealing with datasets that have a large number of classes, as one-hot encoding could lead to memory inefficiencies.

To sum up:

Sparse categorical cross-entropy: Used when ground truth labels are provided as integers representing the class indices.
Categorical cross-entropy: Used when ground truth labels are provided as one-hot encoded vectors.

Final Note: Thanks for reading! I hope you find this article informative.

But our journey doesn’t end here; a new horizon beckons. In “Loss Functions Unraveled | Part 4: Python Walkthrough of Loss Functions,” we’ll embark on a hands-on adventure. So, gear up for the next chapter, where code and concepts converge, and the path to mastery becomes even more tangible.

Stay curious, stay engaged, and let the coding journey flourish!

Loss Functions Unraveled

Written by om pramod