How Focal Loss fixes the Class Imbalance problem in Object Detection

Focal Loss for Object Detection by Facebook AI Research

Yash Marathe

Published in

Analytics Vidhya

6 min readJun 11, 2020

Introduction
What is the Foreground-Background Class Imbalance?
Issues with previous State-of-the-art Object Detectors
Cross-Entropy Loss
Focal Loss Trick
Conclusion
References

Introduction

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler but have trailed the accuracy of two-stage detectors thus far.

It has been discovered that the extreme foreground-background class imbalance encountered during the training of dense detectors is the central cause.

What is the Foreground-Background Class Imbalance?

In foreground-background class imbalance, the over-represented and under-represented classes are background and foreground classes respectively. This type of problem is inevitable because most bounding boxes are labeled as background (i.e. negative) class by the bounding box matching and labeling module as illustrated in the above figure.

The foreground-background imbalance problem occurs during training and it does not depend on the number of examples per class in the dataset since they do not include any annotation on the background.

Issues with previous State-of-the-art Object Detectors

In Object Detection tasks, an imbalanced training set problem is more significant. Given an image, the object detection algorithms usually have to propose a good number of regions in which potential objects might sit. In R-CNN and Fast R-CNN algorithms, the number of regions proposed is limited intentionally to several thousand. In Faster R-CNN models and other models with CNN region proposal mechanisms, the number of regions proposed could be as high as several hundred thousand.

Of course, most of the regions proposed are negative examples where there is no object inside. So this class imbalanced problem should definitely be addressed in object detection.

Both classic one-stage object detection methods, like boosted detectors and DPMs, and other methods, like SSD, face a large class imbalance during training. These detectors evaluate hundreds of candidate locations per image but only a few locations contain objects.

This imbalance causes two problems:

(1) Training is inefficient as most locations are easy negatives that contribute no useful learning signal.

(2) The easy negatives can overwhelm training and lead to degenerate models.

Cross-Entropy Loss

First of all, let’s discuss the common Cross-Entropy loss and how it performs on data with imbalanced classes.

Normally, we use the sigmoid function for binary classification and Softmax function for multi-class classification to calculate the probability of the sample being a certain class. The loss function used, regardless if it is a binary classification or multi-class classification, is usually the cross-entropy loss.

The mathematical expression for the discrete version of cross-entropy is:

Where n is the number of all possible discretized distribution bins, pi is the probability of being in a bin i in distribution p and qi is the probability of being in a bin i in distribution q. It should also be noted that cross-entropy is not symmetric, i.e., H(p,q)≠H(q,p).

Example of Cross-Entropy loss showing contribution from Negative and Positive Examples

Suppose we have 1 million negative examples with p=0.99 and 10 positive examples with p=0.01. (Source)

We can see that negative examples account for 99.54% of the loss whereas positive examples account for only 0.45%.

Focal Loss Trick

The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000).

Focal loss function acts as a more effective alternative to previous approaches for dealing with class imbalance. The loss function is a dynamically scaled cross-entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Experiments show that Focal Loss enables us to train a high-accuracy, one stage detector that significantly outperforms the alternatives of training with the sampling heuristics or hard example mining, the previous state-of-the-art techniques for training one-stage detectors.

The focal loss is defined as:

The two properties of the focal loss can be noted as:

(1) When an example is misclassified and pt is small, the modulating factor is near 1 and the loss is unaffected. As pt → 1, the factor goes to 0 and the loss for well-classified examples is down-weighted.

(2) The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted. When γ = 0, FL is equivalent to CE, and as γ is increased the effect of the modulating factor is likewise increased.

Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives low loss. For instance, with γ = 2, an example classified with pt = 0.9 would have 100× lower loss compared with CE, and with pt ≈ 0.968 it would have 1000× lower loss. This in turn increases the importance of correcting misclassified examples (whose loss is scaled down by at most 4× for pt ≤ .5 and γ = 2).

In practice we use an α-balanced variant of the focal loss:

Example of Focal loss showing contribution from Negative and Positive Examples

Suppose we have 1 million negative examples with p=0.99 and 10 positive examples with p=0.01. (Source)

We can see that now 93.74% contribution is from positive examples and only 6.26% contribution is from negative examples.

Hence positive examples are dominating the total loss now!

As we can see from the above graph, setting γ > 0 reduces the relative loss for well-classified examples (pt > .5), putting more focus on hard, misclassified examples.

Conclusion

Focal loss is very useful for training imbalanced datasets, especially in object detection tasks. This loss applies a modulating term to the cross-entropy loss in order to focus learning on hard negative examples. This approach is simple and highly effective.

References

Thanks for reading! If you have any suggestions, let me know in the comments below.

👏 Your 3 claps mean the world to me! If you found value in this article, a simple tap on those clapping hands would make my day.

🚀 Please consider following for more tech-related content.

🌟 If you found this blog helpful and would like to stay updated with more content, feel free to connect with me on LinkedIn.