A guide to Two-stage Object Detection: R-CNN, FPN, Mask R-CNN

Published in

CodeX

17 min readJul 28, 2021

Multi-stage (Two-stage) object detection

One of the most fundamental and widely researched challenges in computer vision is object detection. The task aims to draw multiple bounding boxes of objects in a given image, which is very important in many fields including autonomous driving. Generally, these object detection algorithms can be classified into two categories: Single-stage models and multi-stage models. In this post, we will dive into the key insights of multi-stage pipelines for object detection by reviewing some of the most significant papers in the field.

One branch of object detectors is based on multi-stage models. Deriving from the work of R-CNN, one model is used to extract regions of objects, and a second model is used to classify and further refine the localization of the object. Such methods are known to be relatively slow, but very powerful, but recent progress such as sharing features improved 2 stage detectors to have similar computational cost with single-stage detectors. These works are highly dependent on previous works and mostly build on the previous pipeline as a baseline. Therefore, it is important to understand all the main algorithms in two-stage detectors.

The selection of papers on this post is mostly based on the survey[8].

R-CNN[1]

The 2014 paper proposes the naive version of CNN based two-stage detection algorithm which is improved and accelerated in the following papers. As described in the figure above, the overall pipeline is composed of three stages:

Generate region proposals: the model must draw candidates of objects in the image, independent from the category.
The second stage is a fully convolutional neural network that computes features from each candidate region.
The final stage is a fully connected layer, expressed as SVMs in the paper.

Region proposals can be generated using various methods, and the paper chooses to use selective search for comparison with prior work. Still, the pipeline is compatible with most region proposal methods. A detailed explanation of selective search is provided here and in this presentation.

To summarize selective search, a segmentation algorithm is applied to the image, and region proposals (bounding boxes) are drawn based on the segmentation map. The segmentation map is iteratively merged and larger region proposals are drawn from the refined map as depicted in the figure below. A detailed explanation of how merging and box drawing works is elaborated here.

http://vision.stanford.edu/teaching/cs231b_spring1415/slides/ssearch_schuyler.pdf

The second and third stages together can be regarded as a conventional CNN that works on the cropped region proposals. The paper uses the convolution portion of AlexNet as the second stage, while any other CNN architecture can be used. Since the region proposals are of different sizes, the paper applies the most naive way to warp and resize all bounding boxes to the desired size.

The authors also use a bounding box classifier trained to further refine the bounding box estimation made by segmentation. Another fully connected network is trained to input the feature maps and regress bounding box offsets in 4 tuples (r, c, h, w) representing relative translations and log-scale width/height scaling factors. This technique showed performance gains in the ablation study as R-CNN BB.

To reject overlapping region proposals in inference, where two or more bounding boxes are pointing to the same object, the authors propose a greedy algorithm that rejects a region if it has a high intersection-over-union (IoU) with another region that has a more confident prediction.

Since the domain of the images is changed to images of warped windows, the classifier model is further trained on warped images and new labels. While training the classifier, regions with >0.5 IoU to a ground truth (GT) box is considered that class and are trained to output the class of the GT box. When the box doesn’t significantly overlap with any GT box, or when the region has <0.5 IoU with every box, the classifier must classify the region in the background class. To resolve potential class imbalance, 32 positive regions and 96 background regions are selected to form a mini-batch of size 128.

While regions with >0.5 IoU are considered to overlap wholly, the paper considers regions with 0.3<IoU<0.5 to partially overlap. Those cases are treated specially by providing a mixed label of the background and GT box class.

The performance supremacy of R-CNN compared to other methods come from the idea to perform the bottom-up style selective search also using CNNs to localize objects and the techniques used in fine-tuning the network on object detection data. This work combines works from classical CV and deep-learning for improving object detection. But R-CNNs are very time-intense because it applies the CNN to about 2,000 warped selective search regions.

SUMMARY

Proposes baseline pipeline for 2 stage object detection: generating region proposals and classifying them.
Region proposal is generated using selective search
The classification network resizes the region proposal and predicts class probabilities(including background) and bounding box refinements.

SPP-Net[2]

The paper proposes to use spatial pyramid pooling (SPP) layers, designed to work for any image size without needing to resize them to a fixed size, which can lead to lost information and distorted images. Convolutions, described as feature extractors in the CNN are not the ones constraining the fixed input size, but the input size constraint is rather because of the fully connected classification layers.

Top: traditional CNN pipeline, Bottom: SPP-net pipeline

Therefore, the authors propose a special pooling layer that transforms features of different sizes and feeds them to the fully connected layers to remove the fixed-size constraint of the network, as described in the figure above.

Basically, the SPP layer applies max-pooling the output in various scales, proportional to the image size. The SPP layer uses spatial bins with proportional sizes to the image size, allowing any shape images to be mapped into a single size. Each spatial bin max-pool the values in its region, and spatial information can be preserved through this process. This is described in the figure below. Each filter is processed with different sizes of pooling that cover a proportion of the image, and the results are concatenated. 256 is the number of filters in the feature map.

Although the SPP layer wasn’t proposed by the authors, they were first to consider using SPP layers in CNNs. SPPs have the following properties:

Generates a fixed-length output regardless of input size
Known to be robust to object deformations(regularization)
Can extract information from various scales(resolutions)

The paper focuses on image classification and shows results on object detection as proof of generalization performance, but has some interesting properties that differ from the R-CNN algorithm when applied to object detection.

The object detection pipeline of SPP-Net is described in the figure above. A CNN is executed once on the full image, and the output features of the CNN are cropped based on regions detected by selective search. SPP is applied on each crop and the class is predicted based on the output of the SPP layer. This way, the convolution layers are only applied to the image once, and only the lighter FC layers are applied corresponding to the number of detected regions.

The convolution feature detector is pretrained on the image classification task and not further trained on object detection. The classifier FC layers are trained separately based on the ground truth windows. Scale invariance is achieved using two methods to preprocess the image, explained in this paper. Many techniques from R-CNN are also applied when fine-tuning the FC network.

The contribution of this paper is truly amazing, since it reduced the training and inference times by orders of magnitudes, while even improving performance from not having to resize and warp images. However, I am skeptical on whether the feature maps trained on image classification truly contain spatial information on cropping the image. This might be a big problem when using deep neural networks since the receptive size will be large, and thus could limit the SPP-Net pipeline from using deeper feature extractors. Some other losses could be used to fine-tune the feature extractor together on the object detection dataset.

SUMMARY

Proposes to apply Spatial Pyramid Pooling to output fixed-length features for arbitrary input sizes.
Improves the training/inference procedure to process to reduce # forward passes from one pass for each region(~2,000 regions per image) to one forward pass on the full image.

Fast R-CNN[3]

Previous object detection algorithms, namely R-CNN typically learn the localization and classification stage separately, which makes training more expensive. Furthermore, these algorithms are very slow at test time, discouraging real-time applications. Fast R-CNN jointly learns to detect spatial locations of objects and to classify them.

R-CNNs are slow because a forward pass is done for each object proposal. While SPP-Nets did address this issue and accelerated R-CNN by 100 times at test time, training is a multi-stage process that requires many steps of intense computation that is only accelerated by 3 times compared to R-CNN. Also, the fixed convolution layers pose limitations in the accuracy of the network.

The figure above illustrates the Fast R-CNN pipeline. A CNN processes the image and the feature map is cropped upon the object proposal. A region of interest(RoI) pooling layer then extracts a fixed-length vector which is then processed through fully connected networks to predict class probabilities and refine the bounding box.

The RoI pooling layer is a special case of the SPP layer, with one pyramid level. The h × w RoI window is divided into an H × W grid of size each h/H × w/W, max-pooling is applied on each grid cell. The output is always an H × W shaped vector. The Fast R-CNN process is very similar to the SPP-Net pipeline, only with small modifications.

Previously in SPP-Nets, backpropagating through the convolution layers was inefficient because the receptive field potentially spans the full image, which is very large. Fast R-CNN counters this issue by training multiple RoI samples from one image simultaneously as a mini-batch. The features can be shared during training, which speeds up training and dispenses with caching features. This trick is named hierarchical sampling. Additionally, Fast R-CNN jointly optimizes the classifier and bounding-box regressors with a multi-task loss instead of training separately.

Joint training of classification loss(L_cls) and localization loss(L_loc)

Some additional improvements to the R-CNN algorithms are also made. For example, Fast R-CNN uses a robust L1 loss instead of L2 loss for regression. There are also modifications in hyperparameters. The paper also combines techniques from R-CNN and SPP-Net. Detailed explanations are provided in the paper. Fast R-CNN was able to achieve S.O.T.A accuracy while being orders of magnitude faster at both training and testing.

SUMMARY

Modifies SPP to RoI pooling
Efficient training by sampling multiple patches from one image->only one forward/backward pass on the convolution layers.
-> Enabled the training of convolution feature extractors through backpropagation

Faster R-CNN[4]

The paper points out that the object proposal stage is the computational bottleneck for real-time object detection. As a solution, Faster R-CNN implements Region Proposal Networks (RPN) that share convolution layers with the feature extractor network, introducing marginal costs for computing object proposals. The pipeline is consistent with Fast R-CNN except that the object proposal is made through in-house trained RPN, as described in the figure below.

The RPN model receives the feature map computed by the feature extractor and outputs a list of object proposals by sliding a small CNN on the feature map. At each sliding window location, the network predicts object proposals for k reference boxes(anchors), each object proposal composed of 4 coordinates and a score that estimates the probability of the object. The RPN model is described in the figure below.

The RPN model is trained separately from the Fast R-CNN classification pipeline. The Fast R-CNN model is trained similarly to the original procedures, including the image-centric sampling strategy. One difference is that the RoI sizes can be determined instead of being arbitrally sized. Therefore, k bounding box regressors, each responsible for refining the corresponding anchor type are trained instead, benefiting from the anchor design.

In training the RPN model, a binary label is assigned for each anchor, based on the IoU with the ground truth bounding box. The labels are either positive, negative, or neutral based on the IoU with the ground-truth box. The RPN model is trained on the score and coordinate estimates. The paper discusses three ways for jointly training the two models through gradient descent. The paper trains the network using alternative training, where the RPN is trained first, and the proposals computed during the process are used to train the Fast R-CNN.

SUMMARY

Instead of slow selective search, proposes RPN to train the bounding box proposal process.
RPN model predicts the probability, location of an object on an anchor.
Compares various training methodologies for efficiently training the RPN model together w/ original region-based detection nets.

Feature Pyramid Networks (FPN) [5]

Featurized image pyramids(figure a) provides multi-scale feature representations that can be handy in object detection by supporting scale invariance. The model must be able to detect all scales of objects in the image, and changing the layers of the pyramid can easily offset the scale variance of objects. But it obviously takes a considerably larger time to compute features of multiple levels and is not used in pipelines such as Fast/Faster R-CNN (figure b).

Convolutional neural networks inherently compute multi-scale feature representations, since each layer hierarchically computes feature maps of different resolutions. However, previous works utilizing the hierarchical property of CNNs to make a featurized image pyramid with minor computation(figure c) were incomplete. The problem of mapping intermediate features of CNNs is that features naturally convey different semantics according to the depth of the network. To fully leverage CNNs for multi-scale feature representations, it is important that the layers have strong semantics at all scales.

The proposed FPN(figure d) is described as

A architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections.

Designed to provide rich semantics to high-resolution features, the FPN is a U-Net-like architecture. A bottom-up pathway(red) is the feedforward CNN. Each resolution is represented as a stage and one pyramid level is defined for each stage. The top-down pathway(blue) produces higher resolution features by upsampling semantically stronger, feature maps from higher pyramid levels. Intuitively, more operations can enhance the semantics of feature maps with any scale, providing rich multi-scale features. The features are further enhanced using features of the bottom-up pathway, projected through lateral connections.

The FPN pipeline provides a general solution for generating multi-scale feature maps with rich semantic content. When applied to the Faster R-CNN object detection pipeline, the FPN architecture is applied in both the RPN network for generating bounding box proposals and in the Fast R-CNN region-based classifier backbone. FPN is adopted to RPN by replacing the backbone network and feeding the FPN output instead of a single feature map. When applying anchors, we apply each scale of anchors on different levels of the pyramid input. e.g. {32² , 64² , 128² , 256² , 512²} sized anchors each for feature map {P2, P3, P4, P5, P6}. The Faster R-CNN detection network is applied on one of the lists of feature maps, determined according to the size of the bounding box.

SUMMARY

Proposes new FPN network architecture to compute semantically rich multi-scale feature representations.
Uses intermediate layers of CNNs as multi-scale features and image pyramid and train RPN and backbone network using these features.

Mask R-CNN[6]

Mask R-CNN is proposed to solve a slightly different problem of instance segmentation. Briefly, this problem is a combination of object detection and semantic segmentation. As illustrated above, the task aims to generate pixel-wise boundaries dividing objects.

Mask R-CNN is based on the Faster R-CNN pipeline but has three outputs for each object proposal instead of two. The additional branch predicts K(# classes) binary object masks that segment the object of each class in the image. The final instance segmentation map to be drawn is selected using the result of the classification branch. This is referred to as decoupling mask and class prediction.

A fully convolutional network (FCN) is used to draw m × m mask from each RoI. Unlike drawing bounding boxes, generating pixel-level masks requires pixel-wise spatial information. So the function branches out before collapsing the features when generating mask segmentation, as described in the figure below.

RoI being small feature maps, are computed by the RoI pooling operation that rigidly slices the feature map into bins. This is that this introduces misalignments between the RoI and the extracted features, which are neglected in classification but can hurt pixel-level masks that can be largely impacted by small translations. A RoIAlign layer is proposed, smoothing the hard slicing of RoIPool. The RoIAlgin layer is basically bilinear interpolation of the large map into the smaller map. The results show great performance gains, and the authors suggest more evidence that the problem was with the inconsistent alignment.

To train the mask branch, a loss term L_mask is added to the original classification and bounding box regression loss function. The mask loss term is calculated as a cross-entropy loss between the ground truth segmentation map with class k and the k th mask.

Not only did this paper enable high-performance instance segmentation, but the results also yielded surprising results in regular bounding box object detection and other tasks like pose estimation. The table above shows the results for bounding box object detection, where Mask R-CNN outperforms faster R-CNN. Faster R-CNN, RoIAlgin shows the results for when the mask loss was not used during training. The results demonstrate that the object detection pipeline learns more generalizable, rich features when trained with the mask prediction objective.

SUMMARY

Proposes a general framework for instance segmentation based on Faster R-CNN by introducing the mask branch.
Fixes the RoIPooling layer by addressing the misalignments in slicing.
Simple, yet amazing paper :)

Cascade R-CNN[7]

If the IoU is above a threshold u, the patch is considered an example of a class, or otherwise considered a background class. Bounding box predictions become noisy when trained on a dataset made using loose IoU thresholds like u=0.5. But increasing the IoU threshold doesn’t solve the problem because of the mismatch of the optimal IoU for training/inference. It will also dramatically decrease the number of positive samples, introducing problems with unbalanced data this is illustrated in the low performance of the red graph on the right plot. Discriminating “close but no correct” bounding boxes is important but wasn’t researched in previous work.

The plots illustrate three detectors trained on IoU thresholds of u = 0.5, 0.6, 0.7. As illustrated in the left figure, each model performs best in a different IoU range. The paper provides more reasoning on why it is difficult for a single classifier to perform uniformly well overall IoU levels. Based on the assumption that a single detector is optimal for a single quality level, Cascade R-CNN trains a sequence of detectors trained with increasing IoU thresholds.

In Faster R-CNN(figure a), the RPN network provides RoI for refining boxes and classification. In Cascade R-CNN, a sequence of heads is provided the bounding box estimate of the previous head instead of the RoI of RPN, interpreted as iteratively refining the bounding box estimates(figure b, d). Theoretically, the output of the next head should progressively improve the bounding box location, but training bounding box refiners with small IoU thresholds will not improve the IoU over a certain value (plot c above). Therefore, Cascade R-CNN is designed as a cascade of different specialized regressors(figure d). Consequently, deeper stages were able to progressively improve towards higher IoU thresholds, as described in the histogram of IoU below.

SUMMARY

Point out the impact of IoU thresholds in object detection, and problems with simply modifying the threshold.
Observed that different model performs best in a different IoU range.
Cascaded bounding box regressors to ensure high confidence bounding box outputs without introducing additional problems.

Conclusion

We reviewed the main approaches for multi-stage object detection. The pace of progress in these algorithms was truly amazing. Trivial R-CNN algorithms were slow and inefficient. Many of the key insights of advanced algorithms were based on sharing features(e.g. SPP-Net, Fast R-CNN, Mask R-CNN) and enabling the gradient training of previously fixed components of the pipeline(e.g. Fast R-CNN, Faster R-CNN, Cascade R-CNN) to learn richer features efficiently. Object detection is a crucial field in computer vision and multi-stage object detection is a mainstream method for object detection.

A recent work in multi-stage object detection is DetectoRS, which proposes to improve the backbone of the network, by proposing a Recursive Feature Pyramid. While recent focus on object detection has shifted towards Transformer based approaches, these papers on multi-stage object detection provide great insights on deep learning in general. The selection of papers introduced in this post was mostly based on [8].

https://paperswithcode.com/sota/object-detection-on-coco

References

[1] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

[2] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9), 1904–1916.

[3] Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).

[4] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 91–99.

[5] Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).

[6] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

[7] Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162).

[8] Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., & Qu, R. (2019). A survey of deep learning-based object detection. IEEE access, 7, 128837–128868.

A guide to Two-stage Object Detection: R-CNN, FPN, Mask R-CNN

Multi-stage (Two-stage) object detection

R-CNN[1]

SPP-Net[2]

Fast R-CNN[3]

Faster R-CNN[4]

Feature Pyramid Networks (FPN) [5]

Mask R-CNN[6]

Cascade R-CNN[7]

Conclusion

References

Written by Sieun Park