Understanding Focal Loss —A Quick Read

Focal Loss implications on solving the class imbalance problem

Published in

VisionWizard

7 min readMay 2, 2020

Yes, you might have got an idea of what I will be discussing in this blog ;p. But before going to the main topic, I want to implant some of the prerequisite points.

In the case of object detection

Positive examples: Target Class or foreground information such as ground-truths.
Negative examples: Non-Target Class or background information such as anchors whose IoU with ground-truths is less than a given threshold.
Easy positives/negatives: Samples classified as positive/negative examples.
Hard positives/negatives: Samples misclassified as negative/positive examples.

2. Class Imbalance Problem

This is observed when information related to one class in a dataset or mini-batches used in training is over-represented than the other classes.
Training a network on an imbalanced dataset will make the network biased towards learning more representations of the data-dominated class and other classes will be underlooked.
In the case of Object Detection, two confidence values are predicted. One is Objectness score(Whether an object is present in a box or not) and the other is Class score(To which class detected object belong to).
So it becomes important to maintain a balanced state between foreground-background and foreground-foreground classes while training. If not handled, former creates a box confidence errors(Whether or not an object is present)and the latter creates class confidence errors(If an object is present in the box, then which class does it represent) during training.

Fig 1: (Left) An Imbalance between Background/Foreground. (Right) An imbalance between Foreground/Foreground. The numbers are of RetinaNet[5] on COCO Dataset(80 Classes)**[5]**.

Two-Stage Detectors have region proposal extractors, that give out many proposals(both. positive and negative) which are further mitigated by some of the sampling techniques mentioned in [1] such as Online Hard Example Mining and IoU/Objectness Thresholding.
On the contrary, One-Shot detectors do not contain region proposal architecture. They directly undergo global regression/classification from feature maps and generate a fixed amount of anchors per location. It becomes somewhat difficult to apply these sampling heuristics to get rid of unwanted negative samples.
Even, if applied, the network then also becomes biased towards learning background information which is of no use. This is illustrated below in a given snippet.

Fig 1: (Left) Grid Locations. (Right) Proposals. Blue: Ground Truth Boxes, Red: Negative Proposals, Green: Positive Proposals.

#Configuration set for YOLO Network:::IoU Thresholding Applied.Image Size = (400, 400) #Image input dimensions.
Output Feature Map Size = (50, 50) #Stride of 8.
Anchor Boxes = [(45, 90), (90, 90), (90, 45)]
Ground Truth = [[50, 100, 150, 150], [250, 210,300, 310]]1. Total Anchors present at each location: 7500(50X50X3)
2. After removing invalid anchors, total remaining anchors: 5301.
3. Total number of Positive RoIs(IoU > 0.5): 5259
4. Total number of Negative RoIs(IoU <= 0.5): 42

As seen from the above Fig. 1 and snippet, there is a vast difference between positive and negative samples, stating a proper imbalance between Foreground and Background in these kinds of detectors(YOLOv3[2] have 3 anchors per location on a feature map at a particular scale).
Some of the challenges which are been faced in training due to imbalance in a dataset are(stated in [5]):-

Training becomes inefficient as most of the samples are easy negatives which contribute no useful signal. This kind of bias makes difficult for a network to learn rich semantic relationships from the images.
Cumulative easy negatives loss overwhelms the total loss, which degenerates the model.

To solve the above problems Facebook A.I came up with a new modified approach by adding a weighting to cross-entropy loss. Let’s now get straight into the topic for which this article is meant for. Brush Up ;).

Focal Loss

What is a focal loss in laymen’s term?

Let’s consider a scenario of a soccer match. You are seeing your favorite rivalry of Real Madrid vs Barcelona on television. The stadium is fully packed with the audience chanting for their respective teams with a total of 70,000 people. Suddenly Real Madrid scores and whole stadium bursts will the chants of “Goaalll..!!”. No one can hear anything but chants, chants, and chants. 3 commentators are analyzing the match that these telecast reporters want to show on the TV. So what they do to? They give more importance to the commentary and give less weight to the chants so that people watching on television can listen to the commentary and live match analysis. Yes, you will hear the chants while watching but, commentary voice will get more amplified. Remember this analogical example for now.We will see how this example relates to Focal Loss

Let’s devise the equations of Focal Loss step-by-step:

Modifying the above loss function in simplistic terms, we get:-

Eq. 3

α term is added to the above Eq 3. to handle the class imbalance problem. It is a hyperparameter that can be used with the CE loss function for cross-validation. α_t is a weighted term whose value is α for positive(foreground) class and 1-α for negative(background) class.

Eq. 5

Eq 5. only handles and controls the weight of positive and negative samples, but it doesn’t take into consideration easy and hard samples. So finally, Focal Loss was designed in such a way that it handles both the mentioned conditions. Two forms of Focal Loss are given below.

Eq. 6. Non-alpha form

Eq. 7. Alpha Form

Visualization and Results of Focal Loss

NOTE: In the case of class imbalance(BG >> FG)
— hard(Misclassified) examples are False negatives(Predicted as background, query box is foreground)
— easy(Correctly Classified) examples are True negatives(Predicted as background, query box is background)
Now recall the analogy stated earlier and compare it with Focal Loss. All the hard(or misclassified — false negatives) samples are given more weight while learning than easy(correctly classified — true negatives) examples. So, even if the imbalance problem is prevalent, this condition is taken care of by Eq. 8.

Below given are the graphs of the Cross-Entropy and Focal Loss(alpha form) for given α_t =0.25 and γ = 4 for given input in the range [0,1].

Graph of Focal Loss(Eq. 8): y=1(left) and y=0(right)

Graph of Cross-Entropy Loss(Eq. 1): y=1(left) and y=0(right)

As we can see from the above-given graphs, it is visible how the loss is propagated for easy examples. For correctly classified examples, the cross-entropy loss at p=0.5 is 0.62 while in the case of 0.00034. Focal Loss decreases the slope of the function which helps in backpropagating(or weighing down) the loss.
α and γ are hyperparameters that can be tweaked for further calibration. γ can also be said as a relaxation parameter in laymen’s terms.
More the value of γ, more importance will be given to misclassified examples and very less loss will be propagated from easy examples. According to the study mentioned in [5], γ=2 gives the best results.
Numbers are always satisfying to justify something. So let’s play with some numbers and focal loss. Below given snippet proves that the easy examples are given less weight than misclassified examples.

#Snippet for Focal and Cross Entropy Loss comparision for proving downweighting of easy examples.#Formula for Cross-Entropy Loss Function
1. -log(x) # For positives
2. -log(1-x) # For negatives#Formula for Focal Loss (Alpha Form)
alpha = 0.25
gamma = 2
1. -alpha * (1 - x)^gamma * log(x) #For positives
2. -(1-alpha) * x^gamma * log(1-x) #For negatives#Case 1 ::: Easy Example(Correctly Classified, y=1, x~1)Foreground with x = 0.9
Cross Entropy loss = 0.045
Focal Loss = 0.000125#Case 2 ::: Hard Example(Misclassified, y=1, x~0)Foreground with x = 0.1
Cross Entropy Loss = 1
Focal Loss = 0.2025#Loss ratio in case 1 :: CE/FE = 360
#Loss ratio in case 2 :: CE/FE = 40

From the above snippet, the CE/FE ratio in case of an easy classification is 360 and in case of a hard classification, it’s just 40 which justifies the stated premise.

Conclusion

Focal loss is used to address the issue of the class imbalance problem. A modulation term applied to the Cross-Entropy loss function, make it efficient and easy to learn for hard examples which were prevailing in One-Shot Object Detectors.

If you have managed to reach here, then I believe you are a part of an elite group who have a thorough understanding to get started in the captivating problem of class imbalance and focal loss.
Please feel free to share your thoughts and ideas in the comment section below.
If you think that article was helpful, please do share it and also clap(s) would hurt no one.

References

[1]Imbalance Problems in Object Detection: A Review, Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Akbas

[2] YOLOv3: An Incremental Improvement, Joseph Redmon, Ali Farhadi

[3] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

[4]Fast R-CNN, Ross Girshick

[5] Focal Loss for Dense Object Detection, Tsung-Yi Lin, Priya Goyal, Ross Girshick, KaimingHe, Piotr Dollar