Focal Loss Demystified

Published in

Escapades in Machine Learning

7 min readAug 10, 2018

I came across the term ‘Focal Loss’ while reading about the iterations of the object detector Yolo. At first I brushed past it treating it like another loss function which is an extension of an entropy loss. However, when I tried to study it a bit in-depth, I found a shortage of articles explaining the need and the working of it. So, the purpose of this post is to consolidate all the knowledge I was able to gather about this concept in one place. I will start by explaining why it is needed followed by how it works. I hope this post will serve as a good starting point for those who are interested in understanding ‘Focal Loss’.

The Problem

Object detectors can be broadly classified into two categories, two-staged and one-staged. One-staged detectors have a single convolution network that is responsible for detection. They usually divide the input image into grid cells which are then used to generate bounding boxes. This is unlike two-staged detectors that use proposal networks (RPN in Faster R-CNN). The problem with grid cells is that they end up creating too many box proposals than proposal networks since each grid cell is responsible for producing a specified number of boxes for example yolo_v2 predicts 5 boxes per cell.

The number of objects to be detected in an image are usually not a lot in number. So, the more box proposals created for an image means that more of them will contain backgrounds rather than the objects for example say 1000 boxes are created by the detector and the number of objects to be detected is only 5. Out of the 1000 boxes only 3 will contain the objects and the rest 997 will contain backgrounds. This leads to a data imbalance problem as too many background examples will bias the classifier to empasize the background in order to minimise the loss. This means that in the overall loss the actual objects (that need to be tested) will be down weighed. This is the primary reason why one-staged detectors are unable to acheive accuracy as high as two-staged detectors.

Hence, data imbalance causes the following problems:

Training is inefficient as most locations are easy negatives (examples that the classifier can correctly classify and are of the opposite class) since most of them are backgrounds. Therefore, it doesn’t lead to any useful learning.
The easy negatives overwhelm the training which leads to degenerate models as they favor the majority class (which in this case would be backgrounds)

Solution

The data imbalance problem can be solved in three ways:

Downsample the dominant cases
Upsample the minority cases
Changing weights in the loss function

The first two points ensure that the number of both the majority and minority cases are balanced in a way that doesn’t bias the classifier. These are the usual methods employed to deal with the problem as the third point means defining a new loss function. The invention of ‘Focal Loss’ is essentially the third way to deal with the problem. Before we dive into the details of ‘Focal Loss’, let’s understand how two-staged detectors are able to avoid the data imbalance problem.

The first stage in the two-staged detectors is responsible for creating a sparse set of regions of interest. It narrows down the number of the regions that need to be classified by the classifier. This unlike the grid cell concept used in single-staged detector produces considerably less number of regions of interest (box proposals). The second stage runs the classifier on each region proposed by the first stage. It also maintains a 1:3 foreground to background sampling ratio too maintain data balance. So, the two stages ensure that the number of examples with background never dominate the foreground ones.

Focal Loss

This is a new loss function created specifically to deal with the data imbalance problem for one-staged detectors. It improves the training with the imbalanced data created by the grid cells. It reshapes the cross entropy loss in such a way that it down weighs the loss assigned to well classified examples. This is so because the number of examples with backgrounds is too high and the classifier will almost always classify the background correctly. So, it will try to reduce the loss calculated on these examples.

Most machine learning models like decision trees, logistic regression etc tend to only predict the majority class data. The features of the minority class are treated as noise and hence ignored. The un-augmented cross entropy loss does the same that is it will treat the background examples as the main class to be predicted and treat the objects as noise and hence pay less attention to the loss computed on them.

Focal loss is a dynamically scaled cross-entropy loss, where the scaling factor autmatically decays to 0 as the confidence in the correct class increases [1].

Working of the Function

The cross entropy loss is defined as :

In the above mentioned formula, ‘y’ represents the ground truth which takes the value of 1 for the correct class and -1 for the incorrect class. The model’s prediction is represented by ‘p’. This function levies a non-trivial (small) loss that are easily classified (p>>0.5). Even though the loss seems to be pretty insignificant, it can completely overwhelm the minority class when it is summed over a large number of easy examples.

For better notation we will use:

Thus,

Fig3: Cross Entropy Loss alternate definition[1]

Let’s consider the case I mentioned earlier. Say, the loss assigned to every well classified (easy examples) is 1 and misclassified example is 10. Say all the 997 boxes containing backgounds are well classified. Their total loss will be 997x1 = 997. Now let’s assume that all the 3 boxes for objects are misclassified. Their total loss is 3x10 = 30. Clearly, the easy examples(background) overwhelm the minority case. This is not desirable for the training of the detector.

The graph below shows the loss assigned to various probabilities of ground truth class by various loss functions. It is evident that when gamma=0, we get CE(pt) which assigns a small value of loss to well-classified examples.

Fig4: Loss assigned to classified examples[1]

A common method to deal with the data imbalance problem employed in the cross entropy loss is to introduce a weighting factor ‘α’ which takes the values in the range 0 to 1.For notational convenience, ‘αt’ is defined similar to ‘pt’. Hence, the augmented cross entropy function is definde as follows:

Fig5: Augmented Cross Entropy Loss[1]

‘α’ balances out the importance of positive and negative examples. However, it doesn’t differentiate between the easy and hard examples. Positive examples are the ones which are assigned ground truth label 1 and negative examples are the ones assigned ground truth label -1. Easy examples are the ones which the classifier labels correctly. The mislabelled examples are called hard examples.

Focal loss down weighs the easy examples and lays emphasis on the hard negative examples. It adds a modulating factor (1 − pt)^γ. By using the predicted probability ‘pt’, the loss function is able to deal with the hard and easy examples which the augmented cross entropy was unable to. Focal Loss is defined as:

‘γ’ is called the focusing parameter. It is tunable and can be learned by the model in a same way as hyper parameters. The purpose of the focusing parameter is to lessen the contribution of the easy examples. It does so in the following ways:

When an example is misclassified and ‘pt’ is small, then ‘γ’ nears 1 and the loss is unaffected
As ‘pt’ approaches 1, the modulating factor (1 − pt)^γ approaches 0 and the loss of the well classified examples is down weighted
‘γ’ smoothly adjusts the rate at which the easy examples are down-weighted. For example, when ‘γ’ = 0 then focal loss becomes normal cross entropy loss
In practice, the following hybrid is used:

Fig7: Augmented Focal Loss[1]

Side Note

Another way of dealing with the data imbalance problem is altering the model initialization. Every model has weights which control the steepness (slope) of the function and bias which is needed to learn any horizontal shifts. The model that has to deal with the imbalanced data is initialized with a pre-defined bias called “prior” for the rare class(foreground) at the start of training.

This means that at the start all boxes are labelled as foreground with the probability which has the value of the prior. For example, every box is labelled as the foreground with a confidence of say prior = 0.01. This prevents large background labelled instances from dominating and destabalising the network.

This concludes the post on Focal Loss.

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

References

[1] https://arxiv.org/pdf/1708.02002.pdf