Focal loss for handling the issue of class imbalance

Published in

Data Science @ Ecom Express

9 min readJun 12, 2023

Text classification is widely used in various industries to tackle business challenges by analyzing text data for valuable insights. However, the complex structure of text data often makes it challenging to extract meaningful information efficiently. Natural language processing (NLP) techniques that focus are employed for this purpose. These techniques focus on how computers interact with human languages and enable the analysis of large volumes of natural language data.

Within NLP, text classification is a common task that aims to categorize and predict the class of unseen text data using supervised machine learning. However, class imbalance problem is one of the challenges that arises when training a text classifier. It refers to situations where certain classes have significantly more data samples than the others that, leads to biased model performance. Focal loss technique is employed to address the class imbalance problem in text classification. This provides a solution by assigning higher weights to misclassified samples, especially those from underrepresented classes. It allows the model to focus more on learning from challenging examples and improve its predictions on these low-trained sampled classes.

This article discusses the benefits of using focal loss for text classification over traditional cross-entropy. We discuss a case of product categorization in the logistics industry. We also demonstrate, why cross-entropy fails in certain scenarios, and explain how focal loss overcomes its limitations, providing improved results in text classification tasks.

Problem Statement

Logistics enable transportation of different types of products from the sellers to the consignees. These products are categorized into different categories. Class imbalance can occur in product categorization when there is a significant disparity in the number of products in each category. This leads to underrepresented categories and a biased model that tends to predict products belonging to the overrepresented category. This poses challenges for inventory management, order fulfilment, and shipping, which rely on accurate product categorization.

When training on imbalanced datasets, models are more likely to classify products into the primary (overrepresented) category, even if they should belong to a secondary or tertiary (underrepresented) category.

To address the class imbalance, techniques like oversampling, under-sampling, and weighted loss functions are often used. Focal loss is one such approach that aims to balance the contribution of each class to the overall loss function. By assigning higher weights to underrepresented classes, focal loss encourages the model to prioritize learning these classes, reducing the impact of the overrepresented class. This helps improve the model’s ability to categorize products accurately as in the case of logistics.

We discuss this issue in the context of product classification of a given product description/title into its respective category. Table 1 below shows some examples from our dataset.

Table 1. Examples of product descriptions and their respective categories

Below are some majority and minority samples classes in our dataset.

Majority sampled classes: Apparel and Accessories, Bags and Luggage, Beauty, Cosmetics and Toiletries, Footwear, Home and Furniture, Jewellery, Kitchen, Mobiles, Tablets and Accessories.
Minority Sampled classes: Sports and Fitness, Hazardous and Other Products, Building Supplies, Hardware and Tools (including automotive), Books and Stationery, Automotive and Accessories (Spares), Groceries, Laptops and Electronic Peripherals, Baby, Kids and Toys Store.

Fig 1. Distribution of Level 1 classes of product categorization engine

Loss Functions

Loss functions are the mathematical equations that explain the deviation between the actual and prediction values. It evaluates the performance of an algorithm on the dataset. The higher the loss values, the more significant the error rate. The aim is to minimize the loss function. The loss function helps in learning trainable parameters, weights and biases.

Cross Entropy Loss

Cross entropy loss, also called logarithmic loss or logistic loss, is a widely used loss function in classification tasks to measure how well the predicted probabilities match the true probabilities. It measures the difference between two probability distributions, typically between the true distribution and a predicted or estimated distribution. Cross-entropy loss is employed to update model weights during training, with the goal of minimizing the loss. A smaller loss indicates a superior model performance and a cross-entropy loss of 0 represents perfection.

Below are some instances in which the Cross-Entropy loss does not perform well:
· Class imbalance can introduce bias in the process. When the majority class examples dominate the loss function and gradient descent, the model tends to become more confident in predicting the majority class while neglecting the minority classes. To address this issue, Balanced Cross-Entropy loss can be used.
· Cross-Entropy loss fails to differentiate between easy and hard examples. Hard examples are those in which the model makes significant errors, while easy examples are straightforward to classify. As a result, Cross-Entropy loss does not allocate more attention to hard samples.

Balanced Cross Entropy Loss:

To mitigate the challenges posed by class imbalance, balanced cross entropy adds a hyperparameter or weighing factor to each class, and is represented by α[0,1]. α is the inverse class frequency or a hyper-parameter that is determined by cross-validation. α replaces the actual label term in the cross-entropy equation. To enhance clarity in notation, we establish a parallel definition for α, similar to how we defined p_t. Consequently, we denote the α-balanced cross-entropy loss as follows:

Fig 3. Expression for α-balanced Cross Entropy Loss, Image Source: [1]

The class imbalance problem is resolved by balanced cross-entropy, but it cannot distinguish between the hard and easy examples. This issue is solved by focal loss.

Focal loss

Focal loss aims to improve the model’s performance on hard examples by focusing on its mistakes rather than just relying on its confidence level when predicting easy examples. It enhances the model’s handling of difficult examples by prioritizing its errors instead of relying solely on confidence in predicting easy examples. This is achieved through down- weighting, a technique that reduces the impact of easy examples on the loss function, thereby emphasizing attention on hard examples. Down- weighting is applied by introducing a modulating factor (1 − pt) ^γ to the cross-entropy loss, with tunable focusing parameter γ ≥ 0.

Fig 4. Expression for Focal Loss, Image Source: [1]

where γ (Gamma) is the focusing parameter or the relaxation parameter to be tuned using cross-validation. It controls the degree of focus on hard, misclassified examples during the training of a neural network. A larger value of γ emphasizes the misclassified examples, while a smaller value of γ results in a more balanced focus on easy and hard examples.

The image below shows how focal loss behaves for different values of γ.

Fig 5. Down weighting increases with an increase in γ, Image Source: [1]

The focal loss is visualized for several values of γ ∈ [0, 5] in the above Figure 5. From the paper, the authors noted two properties of the focal loss.

(1) When an example is misclassified, and p_t is small, the modulating factor is near 1 and the loss is unaffected. As p_t → 1, the factor goes to 0 and the loss for well-classified examples is down weighted.

(2) The focusing parameter γ smoothly adjusts the rate at which easy examples are down weighted. When γ = 0, Focal loss is equivalent to cross-entropy, and as γ is increased, the effect of the modulating factor is likewise increased.

α-Balanced Focal Loss

This variant combines the characteristics of the weighing factor α (from the ideas of balanced cross-entropy loss) and the focusing parameter γ, which further results in improved accuracy with respect to the non-balanced form. α-balanced focal loss handles the class imbalance by introducing two components focal loss and the weighting factor α.Focal loss down weights the loss contribution from well-classified examples, which allows the model to focus on hard-to-classify examples. The focusing parameter γ smoothly adjusts the rate at which easy examples are down weighted. α is used to adjust the weights assigned to different classes. In practice, α may be set by inverse class frequency or treated as a hyperparameter to set by cross-validation.α value scales and balances the loss function and yields slightly improved accuracy over the non-α-balanced form.

By leveraging the qualities of both α and γ, the α-balanced focal loss exhibited superior performance in our study.

Fig 6. Expression for α-balanced Focal Loss, Image Source: [1]

Intuitive Understanding of Focal Loss

In this section, we understand how focal loss can give more weightage to underrepresented and hard-to-classify samples, and vice-versa. We do so by taking samples of two kinds of classes and calculating the values of the different loss functions for these samples. It helps us understand how focal loss can focus on the underrepresented and hard samples.

CASE 1: Easy and majority class samples

An easy example (majority sample class) is where we assume that the sample is correctly classified therefore, the actual class is 1 and the predicted class is also 1.

Foreground with x = 0.9. Here, the sample is correctly classified, with the actual class (y) and predicted class both being 1. The probability of the positive class is denoted as ‘p = 0.9’. Since it is a majority-class example, we can consider it as a foreground example. Here we are taking alpha = 0.25 and gamma = 2 as fixed parameters.

CE = -(1) log (0.9) = 0.045

FL = -(0.25) *(1–0.9) ² log (0.9) = 0.0039

CASE 2: Hard and minority class samples

Hard example (minority sample class) where we assume that the sample is misclassified, therefore the actual class is 1 and the predicted class is 0.

Here, the sample is misclassified, with the actual class (y) being 1 and the predicted class being 0. The probability of the positive class is denoted as ‘p = 0.1’. Since it is a minority class example, we can consider it as a foreground example.

CE = -(1) log (0.1) = 1

FL = -(1–0.25) * (1-0.1) ² log (0.1) = 0.6075

As per the above cases, we calculate the ratio of the two loss values to do the comparison.

1. Loss ratio in CASE 1: — CE/FL ~ 12

2. Loss ratio in CASE 2: — CE/FL ~ 2

In Case 1, the CE/FL ratio is 12, which means the cross-entropy loss is 12 times higher than the focal loss. It indicates that the cross-entropy loss assigns significantly higher importance or weightage to the easy example, which is correctly classified. The focal loss, on the other hand, downweighs the contribution of this easy example, resulting in a much lower focal loss value.
In Case 2, the CE/FL ratio is 2, which means the cross-entropy loss is 2 times higher than the focal loss. It indicates that the cross-entropy loss still assigns higher weightage to the hard example, even though it is misclassified. The focal loss, with its down weighting effect on easy examples, reduces the loss contribution of the easy example, resulting in a lower focal loss value compared to the cross-entropy loss.

Experiments and Evaluation

Here we do some experiments on the product classification problem as described earlier. We train a neural network-based classifier three times with cross-entropy, balanced cross-entropy, and focal loss (α- balanced) functions and compute the performance metrics. We use the same model architecture and the same training and testing datasets. The loss functions are different in the three scenarios. The following table demonstrates the efficacy of focal loss over the cross-entropy and balanced cross-entropy loss functions.

Table 2. Accuracy matrix and precision matrix of minority classes (in our dataset) before and after applying Focal loss (FL = Focal loss, BCE = Balanced Cross Entropy, CE = Cross entropy)

In these experiments, we employed an α-balanced variant of the focal loss. We found the values of the hyperparameters of focal loss experimentally. α hyperparameter is set by taking the inverse frequency of minority classes and γ=2 gave the best-performing metrics in our experiments.

Overall, this study demonstrated the effectiveness of focal loss to address the issue of class imbalance, improve the performance of classes with limited training samples, provide flexibility to adjust the learning process, and mitigate the impact of noisy data. By leveraging focal loss, businesses can build more accurate and reliable text classifiers to extract valuable insights and make informed decisions from text data more efficiently.

References

[1] Focal Loss for Dense Object Detection, Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar

Authors

Uman Niyaz — Data Scientist @Ecom Express Limited
Asmita Bhardwaj — Associate Data Scientist @Ecom Express Limited