# The intuition behind RetinaNet

## The end goal of this blog post is to make readers intuitively understand the deep working of RetinaNet.

FAIR has released two papers in 2017 and 2018 respectively on their state of the art object detection frameworks. We will see how various layers come together to form a robust object detection pipeline. As usual I will take the math approach to explain everything. In short we will discuss the following two papers

Note: This blog-post doesn’t include any code.

#### The following topics will be discussed

- Anchor boxes
- How RPN works ?
- Problems with RPN
- Building Feature Pyramid networks
- Focal Loss
- RetinaNet for Object detection
- Training RetinaNet
- Inference on RetinaNet

#### Anchor boxes:

Anchor boxes were first introduced in Faster RCNN paper and later became a common element in all the following papers like yolov2, ssd and RetinaNet. Previously selective search and edge boxes used to generate region proposals of various sizes and shapes depending on the objects in the image, with standard convolutions it is highly impossible to generate region proposals of varied shapes, so anchor boxes comes to our rescue.

The below diagram helps in viewing the valid anchor boxes on a Image,

#### How RPN works ?

To the above diagram 1 attach two heads one for regression and the other for classification as shown below. We will discuss how these heads are designed below.

**Regression head:** The output of the Faster RPN network as discussed and shown in the image above is a 50*50 feature map. A conv layer [kernal 3*3] strides through this image, At each location it predicts the 5 [x1, y1, h1, w1] values for each anchor boxes (9). In total, the output layer has 50*50*9*4 output probability scores. Usually this is represented in numpy as np.array(2500, 36).

**Classification head: **Similar to the Regression head, this will predict the probability of an object present or not at each location for each anchor bos. This is represented in numpy array as np.array(2500, 9)

**Problems with RPN**

- The Feature map created after a lot of subsampling losses a lot of semantic information at low level, thus unable to detect small objects in the image. [Feature Pyramid networks solves this]
- The loss functions uses negative hard-mining by taking 128 +ve samples, 128 -ve samples because using all the labels hampers training as it is highly imbalanced and there will be many easily classified examples. [Focal loss solves this]

**How Feature pyramid networks work ?**

In RPN, we have built anchor boxes only using the top high level feature map. Though convnets are robust to variance in scale, all the top entries in ImageNet or COCO have used multi-scale testing on featurized image pyramids. Imagine taking a 800 * 800 image and detecting bounding boxes on it. Now if your are using image pyramids, we have to take images at different sizes say 256*256, 300*300, 500*500 and 800*800 etc, calculate feature maps for each of this image and then apply non-maxima supression over all these detected positive anchor boxes. This is a very costly operation and inference times gets high.

The authors of this paper observed that deep convnet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. For example, take a Resnet architecture and instead of just using the final feature map as shown in RPN network, take feature maps before every pooling (subsampling) layer. Perform the same operations as for RPN on each of these feature maps and finally combine them using non-maxima supression. This is the crude way of building the feature pyramid networks. But there is one of the problem with this approach, there are large semantic gaps caused by different depths. The high resolution maps (earlier layers) have low-level features that harm their representational capacity for object detection.

The goal of the authors is to naturally leverage the pyramidal shape of a Convnet feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, the authors relayed on a architecture that combines low-resolution, semantically strong features with high-resolution, semantically strong features via top-down pathway and lateral connection as shown in the diagram below.

We can design these blocks with more sophistication but the authors have found marginal improvements, so they have chosen to keep the network simple.

The predictions are made on each level independently.

**Important points while designing anchor boxes:**

- Since the pyramids are of different scales, no need to have multi-scale anchors on a specific level. We define the anchors to have size of [32, 54, 128, 256, 512] on P3, P4, P5, P6, P7 respectively. We use anchors of multiple aspect ratio [1:1, 1:2, 2:1]. so in-total there will be 15 anchors over the pyramid at each location.
- All the anchor boxes outside image dimensions were ignored.
- positive if the given anchor box has highest IoU with the ground truth box or if the IoU is greater than 0.7. negative if the IoU is less than 0.3.
- The scales of the ground truth boxes are not used to assign them to levels of the pyramid. Instead, ground-truth boxes are associated with anchors, which have been assigned to pyramid levels. This above statement is very important to understand. I had two confusions here, weather we need to assign ground truth boxes to each level separately or compute all the anchor boxes and then assign label to the anchor box with which it has max IoU or IoU greater than 0.7. Finally I have chosen the second option to assign labels.

**Implementation details on coco:**

- The input image is resized such that its shorter side has 800 pixels. A mini-batch involves 2 images per GPU and 256 anchors per image.
- weight decay of 0.0001 and a momentum of 0.9. The learning rate is 0.02 for first 30k iteration and 0.002 for next 10k iteration.
- Training RPN with FPN on 8 GPUs takes 8 hours on COCO.

Further a lot of ablation studies were done to prove that their choice of FPN is correct.

### Focal Loss

One-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors because of extreme class imbalance encountered during training.

Focal loss is the reshaping of cross entropy loss such that it down-weights the loss assigned to well-classified examples. The novel focal loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

Lets look at how this focal loss is designed. We will first look at binary cross entropy loss for single object classification

CE Loss

For high class imbalance, we add a weighting parameter. usually this is inverse class frequency or treated as hyper-parameter set by cross-validation. Here we will term it as alpha called **balancing param.**

As mentioned in paper, easily classified negatives comprise the majority of the loss and dominate the gradient. While alpha balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. So the authors have reshaped the cross entropy function and come up with focal loss as mentioned below

Here gamma is called the focusing param and alpha is called the balancing param.

Lets get an intuition for this can help us:

**Scenario-1: Easy correctly classified example**

Say we have an easily classified foreground object with p=0.9. Now usual cross entropy loss for this example is

CE(foreground) = -log(0.9) = 0.1053

Now, consider easily classified background object with p=0.1. Now usual cross entropy loss for this example is again the same

CE(background) = -log(1–0.1) = 0.1053

Now, consider focal loss for both the cases above. We will use alpha=0.25 and gamma = 2

FL(foreground) = -1 x 0.25 x (1–0.9)**2 log(0.9) = 0.00026

FL(background) = -1 x 0.25 x (1–(1–0.1))**2 log(1–0.1) = 0.00026.

#### Scenario-2: misclassified example

Say we have an misclassified foreground object with p=0.1. Now usual cross entropy loss for this example is

CE(foreground) = -log(0.1) = 2.3025

Now, consider misclassified background object with p=0.9. Now usual cross entropy loss for this example is again the same

CE(background) = -log(1–0.9) = 2.3025

Now, consider focal loss for both the cases above. We will use alpha=0.25 and gamma = 2

FL(foreground) = -1 x 0.25 x (1–0.1)**2 log(0.1) = 0.4667

FL(background) = -1 x 0.25 x (1–(1–0.9))**2 log(1–0.9) = 0.4667

#### Scenario-3: Very easily classified example

Say we have an easily classified foreground object with p=0.99. Now usual cross entropy loss for this example is

CE(foreground) = -log(0.99) = 0.01

Now, consider easily classified background object with p=0.01. Now usual cross entropy loss for this example is again the same

CE(background) = -log(1–0.01) = 0.1053

Now, consider focal loss for both the cases above. We will use alpha=0.25 and gamma = 2

FL(foreground) = -1 x 0.25 x (1–0.99)**2 log(0.99) = 2.5*10^(-7)

FL(background) = -1 x 0.25 x (1–(1–0.01))**2 log(1–0.01) = 2.5*10^(-7)

**Conclusion:**

scenario-1: 0.1/0.00026 = 384 times smaller number

scenario-2: 2.3/0.4667 = 5 times smaller number

scenario-3: 0.01/0.00000025 = 40,000 times smaller number.

These three scenarios clearly show that Focal loss add very less weight to well classified examples and large weight to miss-classified or hard classified examples.

This is the basic intuition behind designing Focal loss. The authors have tested different values of alpha and gamma and final settled with the above mentioned values.

**Important points to note:**

- When training for object detection, the focal loss is applied to all ~100k anchors in each sampled image.
- The total focal loss of an image is computed as the sum of the focal loss over all ~100k anchors, normalized by the number of anchors assigned to a ground-truth box.
- gamma =2 and alpha =0.25 works best and in general alpha should be decreased slightly as gamma is increased.

### RetinaNet for object detection

RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a conv feature map over an entire input image and is an off-the-self convolution network. The first subnet performs classification on the backbones output; the second subnet performs convolution bounding box regression.

**Backbone: **Feature Pyramid network built on top of ResNet50 or ResNet101. However we can use any classifier of your choice; just follow the instructions given in FPN section when designing the network.

**Classification subnet:** It predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. Takes a input feature map with C channels from a pyramid level, the subnet applies four 3x3 conv layers, each with C filters amd each followed by ReLU activations. Finally sigmoid activations are attached to the outputs. Focal loss is applied as the loss function.

**Box Regression Subnet: **Similar to classification net used but the parameters are not shared. Outputs the object location with respect to anchor box if an object exists. smooth_l1_loss with sigma equal to 3 is applied as the loss function to this part of the sub-network.

#### Training RetinaNet

- Initialization of the network is very important. The authors have assumed a prior probability of 0.01 for all the anchor boxes and assigned this to the bias of last conv layer of classification sub net. I have overlooked this when implementing the net, The loss function blows up if you don’t take care of this. The intuition behind attaching this prior probability is that the foreground (All positive anchor boxes) to background objects (All negative anchor boxes) in image is 1000/100000 = 0.01.
- Weight decay of 0.0001 and momentum of 0.9 with initial learning rate of 0.01 is used for first 60k Iterations. learning rate is reduced by 10 after 60k iterations and 80k iterations.
- Achieved an mAP of 40.8 using ResNeXt-101-FPN backend on MS-coco dataset

#### Inference on RetinaNet

- To improve speed, Decode box predictions from at most 1k top-scoring predictions per FPN level, after thresholding the detector confidence at 0.05.
- The top predictions from all levels are merged and non-maximum suppression with a threshold of 0.5 is applied to yield the final decisions.

#### References:

**[1708.02002] Focal Loss for Dense Object Detection**

*Abstract: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a…*arxiv.org

**[1612.03144] Feature Pyramid Networks for Object Detection**

*Abstract: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But…*arxiv.org

Clap and share if you like this post. Comment below if you have any feedback or doubt. Thanks for your time. Hope it helps.

A big thanks to Fractal Analytics and my colleagues here. Special thanks to Soumendra and Prato_s [The AIJournal Team].