Paper Summary: Faster R-CNN

4 min readNov 23, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/06.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015) Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

The sensibly titled Faster R-CNN paper is the follow-on work to yesterday’s Fast R-CNN, taking the next logical step of bringing region proposals — the current performance bottleneck — into the network. The authors call this the region proposal network (RPN) and they point out that it can be thought of as an attention mechanism: “the RPN module tells the Fast R-CNN module where to look.” Convolutional feature maps are shared by the parts of the network that produce region proposals and bounding box predictions, and the whole network can be trained end to end.

One key to getting this to work well was the introduction of anchor boxes (screenshot below right), which I think of as a set of crop regions of different sizes and aspect ratios that are applied to each region of interest (RoI). One might imagine an architecture where different sized images are passed into the network (“image pyramids”) or a network that computes different scaling and squashing operations. This is not how anchor boxes work. Since they’re applied at the very end of the RPN they allow efficient feature sharing earlier in the network. Each anchor box predicts bounding box coordinates for the RoI, and the regressors do not share weights across anchor boxes, so this is an ensemble for every RoI.

The architecture consists of two modules, the RPN and a Fast R-CNN detector (screenshot above left). We’ve already covered the latter yesterday, so a little more detail on the former (screenshot above right): the authors talk about this as a sliding window over 3x3 regions of the convolutional feature map, but I think it’s clearer to think of it as just a 3x3 convolution (the intermediate layer) followed by two 1x1 convolutions that produce class scores (cls layer — object vs background) and bounding box coordinates (reg layer — for box regression) respectively. These are the region proposals that are further refined by the Fast R-CNN module. They also talk about which convolutional layers can be shared, which depends on the specific network architecture used.

The loss function. The cls layer is trained with anchor boxes that have been labeled as positive or negative based on some intersection-over-union (IoU) heuristics. Interestingly they throw out anchor boxes that are ambiguous (roughly between 0.3 and 0.7 IoU, but see the paper for details), which is the opposite approach to the hard example mining in the Fast R-CNN paper, so I’m not sure what’s going on here. In any case, the loss function is a linear combination of cls losses (log loss over object / not-object) and reg losses for positive anchor boxes only. For reg they use a smooth L1 loss like in Fast R-CNN. There’s also normalization and a λ, but the specific values of these don’t seem to matter much.

In training they use mini-batches of 256 anchors drawn from a single image, with the anchors chosen to be up to a 1:1 ratio of positive to negative anchors. The weights are initialized with networks pre-trained on ImageNet, where possible. There’s a detail about how the network is trained, which must have been a pain to figure out, but basically they alternate between training the RPN and the Fast R-CNN (freezing the one not being trained). They also train the modules jointly, which basically just works but is strictly speaking incorrect since the RoI pooling layers aren’t differentiable with respect to the proposed box coordinates (Dai 2015 does this the “right” way with “RoI warping,” but this is out of the scope of this paper).

Lots more details here, stuff about handling anchor boxes that extend past the input image’s borders, running non-maximum suppression on the proposal regions (I’m not quite clear on how this fits in to their architecture), and a bunch of ablation studies. Also they were able to gain 5.6 percentage points on the PASCAL VOC dataset by also training on the much larger MS COCO. These algorithms do like their data.

Jifeng Dai et al 2015 “Instance-aware Semantic Segmentation via Multi-task Network Cascades” https://arxiv.org/abs/1512.04412

Paper Summary: Faster R-CNN

Written by Mike Plotz Sage