Digging into Detectron 2 — part 5

ROI (Box) Head

8 min readJun 4, 2020

Figure 1. Inference result of Faster (Base) R-CNN with Feature Pyramid Network.

Hi I’m Hiroto Honda, a computer vision researcher¹. [homepage] [linkedin]

In this article I would like to share my learnings about Detectron 2 — repo structure, building and training a network, handling a data set and so on.

Detectron 2 ² is a next-generation open-source object detection system from Facebook AI Research.

facebookresearch/detectron2

Detectron2 is Facebook AI Research's next generation software system that implements state-of-the-art object detection…

github.com

In part 1, part 2, part 3 and part 4, we have seen the overview of the Base-RCNN-FPN, feature pyramid network, ground truth preparation and region proposal network, respectively.
This time, we are going deep into the final part of the pipeline— the ROI (Box) Head³ (see Fig. 2).

Figure 2. Detailed architecture of Base-RCNN-FPN. Blue labels represent class names.

At the ROI (Box) Head, we take a) feature maps from FPN, b) proposal boxes, and c) ground truth boxes as input.

a) feature maps from FPN

As we have seen in part 2, the output feature maps from FPN are:

output[“p2”].shape -> torch.Size([1, 256, 200, 320]) # stride = 4 
output[“p3”].shape -> torch.Size([1, 256, 100, 160]) # stride = 8
output[“p4”].shape -> torch.Size([1, 256, 50, 80]) # stride = 16
output[“p5”].shape -> torch.Size([1, 256, 25, 40]) # stride = 32
output[“p6”].shape -> torch.Size([1, 256, 13, 20]) # stride = 64

Each tensor size stands for (batch, channels, height, width). We use the feature dimensions above throughout the blog series. The P2-P5 features are fed to the box head and P6 is not used.

b) proposal boxes are included in the output instances from RPN (see Part 4) which have 1000 ‘proposal_boxes’ and 1000 ‘objectness_logits’. In the ROI Heads, only proposal boxes are used to crop the feature map and deal with the ROIs and objectness_logits is not used.

{'proposal_boxes': 
    Boxes(tensor([[675.1985, 469.0636, 936.3209, 695.8753],
                  [301.7026, 513.4204, 324.4264, 572.4883],
                  [314.1965, 448.9897, 381.7842, 491.7808],
               ...,
'objectness_logits': 
    tensor([ 9.1980,  8.0897,  8.0897, ...]
}

c) ground truth boxes have been loaded from the dataset (see Part 3):

'gt_boxes': Boxes(tensor([
[100.55, 180.24, 114.63, 103.01],
[180.58, 162.66, 204.78, 180.95]
])),
'gt_classes': tensor([9, 9])

Fig. 3 shows the detailed schematic of the ROI Heads. All the computation is performed on GPU in Detectron 2.

Figure 3. Schematic of ROI Heads. Blue and red labels represent class names and chapter titles respectively.

1. Proposal Box Sampling

(only during training)

In RPN, we have obtained 1,000 proposal boxes from the five levels of FPN features (P2 to P6).

The proposal boxes are used to crop the regions of interest (ROIs) from the feature maps, which are fed to the Box Head. To accelerate the training, ground-truth boxes are added to the predicted proposals. For example, if the image has two ground truth boxes, the total number of proposals will be 1002.

During training, the foreground and background proposal boxes are firstly re-sampled to balance the training objective.

The proposals that have higher IoUs than the threshold are counted as foreground and the others as background by using Matcher (see Fig. 4). Note that in ROI Heads there are no ‘ignored’ boxes unlike RPN. Added ground-truth boxes perfectly match themselves, thus are counted as foreground.

Figure 4. Matcher determines assignment of anchors to ground-truth boxes. The table shows the IoU matrix whose shape is (number of GT boxes, number of anchors).

Next, we balance the numbers of foreground and background boxes. Let N be the target number of (foreground + background) boxes and F be the target number of foreground boxes. N and F / N are defined by the following config parameters. The boxes are sampled as shown in Fig. 5 so that the number of foreground boxes is less than F.

N: MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE (typically 512)
F/N: MODEL.ROI_HEADS.POSITIVE_FRACTION (typically 0.25)

Figure 5. Re-sampling the foreground and background proposal boxes.

2. ROI Pooling

The ROI pooling process crops (or pools) the rectangle regions of the feature maps that are specified by the proposal boxes.

level assignment

Assume that we have two proposal boxes (gray and blue rectangles in Fig. 6) and the feature maps P2 to P5.

Which feature map should each box crop an ROI from? If you assign the small gray box to the P5 feature, only one or two feature pixels would be contained inside the box, which is not informative.

There is a rule to assign a proposal box to the appropriate feature map:

assigned feature level: floor(4 + log2(sqrt(box_area) / 224))

where 224 is the canonical box size. For example, if the size of the proposal box is 224×224, it is assigned to the 4th level (P4).

In case of Fig. 6, the gray box is assigned to the P2 level and the blue one to the P5. The level assignment is carried out at the assign_boxes_to_levels function.

Figure 6. feature level assignment of proposal boxes for ROI pooling.

ROIAlignV2

In order to accurately crop the ROI by the proposal boxes which have floating-point coordinates, a method called ROIAlign has been proposed in the Mask R-CNN paper⁴. In Detectron 2, the default pooling method is called ROIAlignV2, which is the slightly modified version of ROIAlign.

In Fig. 7, both ROIAlignV2 and ROIAlign are depicted. A large rectangle is one bin (or pixel) in an ROI. To pool the feature value inside the rectangle, four sampling points are placed to interpolate the four neighboring pixel values. The final bin value is calculated by averaging the four sampling point values. The difference between ROIAlignV2 and ROIAlign is simple. The half-pixel offset (0.5) is subtracted from ROI coordinates to compute neighboring pixel indices more accurately. Please look at Fig. 7 for the details.

Figure 7. ROIAlignv2. Compared with ROIAlign(v1), the half-pixel offset (0.5) is subtracted from ROI coordinates to compute neighboring pixel indices more accurately. ROIAlignV2 employs the pixel model in which pixel coordinates represent the centers of pixels.

Now the ROIs are cropped from the corresponding levels (P2-P5) by ROIAlignV2. The resulting tensor has the size of:

[B, C, H, W] = [N × batch size, 256, 7, 7]

where B, C, H and W stand for the number of ROIs across the batch, channel number, height and width respectively. By default the number of ROIs for one batch N is 512 and the ROI size 7 ×7. The tensor is the collection of cropped instance features which include balanced foreground and background ROIs.

3. Box Head

After ROI Pooling, the cropped features are fed to the head networks. As for Mask R-CNN⁴, there are two types of heads: the box head and the mask head. However Base R-CNN FPN only has the box head called FastRCNNConvFCHead which classifies the object within the ROI and fine-tunes the box position and shape.

The layers of the Box Head by default are as follows:

(box_head): FastRCNNConvFCHead(
      (fc1): Linear(in_features=12544, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=1024, bias=True)
    )
    (box_predictor): FastRCNNOutputLayers(
      (cls_score): Linear(in_features=1024, out_features=81, bias=True)
      (bbox_pred): Linear(in_features=1024, out_features=320, bias=True)

As you can see, no convolution layers are included in the head.

The input tensor whose size is [B, 256, 7, 7] is flattened to [B, 256×7 ×7 = 12,544 channels] to be fed to the fully-connected (FC) layer 1 (fc1).

After two FC layers the tensor gets to the final box_predictor layers: cls_score (Linear) and bbox_pred (Linear).
The output tensors from the final layers are:

cls_score -> scores             # shape:(B, 80+1)
bbox_pred -> prediction_deltas  # shape:(B, 80×4)

Next we see how to calculate the loss for the outputs during training.

4. Loss Calculation

(only during training)

Two loss functions are applied to the final output tensors.

localization loss (loss_box_reg)

l1 loss⁵.
foreground predictions are picked from the pred_proposal_deltas tensor whose shape is (N samples × batch size, 80×4). For example, if the 15th sample is a foreground with class index = 17, the indices of [14 (=15–1), [68 (=17×4), 69, 70, 71]] are selected.
foreground ground truth targets are picked from gt_proposal_deltas whose shape is (B, 4). The tensor values are the relative sizes of the ground truth boxes compared with the proposal boxes, which are calculated by the Box2BoxTransform.get_deltas function (see section 3–3 of Part4). The tensor with foreground indices are sampled from gt_proposal_deltas.

classification loss (loss_cls)

Softmax cross entropy loss.
Calculated for all the foreground and background prediction scores [B, K classes] vs ground truth class index [B]
Classification objectives are both foreground and background classes, so K = number of classes + 1 (background class index is ‘80’ for COCO dataset).

The loss results below are added to the losses calculated in RPN — ‘loss_rpn_cls’ and ‘loss_rpn_cls’ — and summed up to be the pipeline’s total loss.

{
'loss_cls': tensor(4.3722, device='cuda:0', grad_fn=<NllLossBackward>),
 'loss_box_reg': tensor(0.0533, device='cuda:0', grad_fn=<DivBackward0>)
}

5. Inference

(only during test)

As we saw in Section 3, we have scores whose shape is (B, 80+1) and prediction_deltas whose shape is (B, 80×4) as output from the Box Head.

1. apply prediction deltas to proposal boxes

To calculate the final box coordinates from the prediction deltas⁶ : Δx, Δy, Δw, and Δh, Box2BoxTransform.apply_deltas function is used (Fig. 8). This is the same procedure as the step 1 in the section 5 of Part 4.

Figure 8. applying prediction deltas to a proposal box to calculate the coordinates of the final prediction box.

2. filter the boxes by scores

We firstly filter out the low-scored bounding boxes as shown in Fig. 9 (left to center). Each box has a corresponding score, so it’s quite easy to do that.

Figure 9. Post-processing at the inference stage. left: visualization of all the ROIs before post-processing. center: after score thresholding. right: after non-maximum suppression.

3. non-maximum suppression

To remove the overlapping boxes, non-maximum suppression (NMS) is applied (Fig. 9, center to right). The parameter of NMS is defined here.

4. choose top-k results

Lastly we choose the top-k results when the number of the remaining boxes is more than the pre-defined number.

Thanks for reading!

So, that’s it!!
If you find something wrong or when you have questions, feel free to give me a response!

part 1: Introduction — Basic Network Architecture and Repo Structure
part 2 : Feature Pyramid Network
part 3 : Data Loader and Ground Truth Instances
part 4 : Region Proposal Network
part 5: (you are here) Box Head

[1] This is a personal article and the opinions expressed here are my own and not those of my employer.
[2] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo and Ross Girshick, Detectron2. https://github.com/facebookresearch/detectron2, 2019. The file, directory, and class names are cited from the repository ( Copyright 2019, Facebook, Inc. )
[3] the files used for ROI Heads are: modeling/roi_heads/roi_heads.py, modeling/roi_heads/box_head.py, modeling/roi_heads/fast_rcnn.py, modeling/box_regression.py, modeling/matcher.py and modeling/sampling.py .
[4] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
[5] Implemented as smooth-l1 loss, but it’s actually used as pure-l1 loss unlike Detectron1 or MMDetection. (see : link).
[6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. (link)