Hao Gao
Hao Gao
Sep 27, 2017 · 5 min read

Faster R-CNN Explained

Faster R-CNN has two networks: region proposal network (RPN) for generating region proposals and a network using these proposals to detect objects. The main different here with Fast R-CNN is that the later uses selective search to generate region proposals. The time cost of generating region proposals is much smaller in RPN than selective search, when RPN shares the most computation with the object detection network. Briefly, RPN ranks region boxes (called anchors) and proposes the ones most likely containing objects. The architecture is as follows.

Image for post
Image for post
The Architecture of Faster R-CNN


Anchors play an important role in Faster R-CNN. An anchor is a box. In the default configuration of Faster R-CNN, there are 9 anchors at a position of an image. The following graph shows 9 anchors at the position (320, 320) of an image with size (600, 800).

Image for post
Image for post
Anchors at (320, 320)

Let’s look closer:

  1. Three colors represent three scales or sizes: 128x128, 256x256, 512x512.
  2. Let’s single out the red boxes/anchors. The three boxes have height width ratios 1:1, 1:2 and 2:1 respectively.

If we choose one position at every stride of 16, there will be 1989 (39x51) positions. This leads to 17901 (1989 x 9) boxes to consider. The sheer size is hardly smaller than the combination of sliding window and pyramid. Or you can reason this is why it has a coverage as good as other state of the art methods. The bright side here is that we can use region proposal netowrk, the method in Fast RCNN, to significantly reduce number.

These anchors work well for Pascal VOC dataset as well as the COCO dataset. However you have the freedom to design different kinds of anchors/boxes. For example, you are designing a network to count passengers/pedestrians, you may not need to consider the very short, very big, or square boxes. A neat set of anchors may increase the speed as well as the accuracy.

Region Proposal Network

The output of a region proposal network (RPN) is a bunch of boxes/proposals that will be examined by a classifier and regressor to eventually check the occurrence of objects. To be more precise, RPN predicts the possibility of an anchor being background or foreground, and refine the anchor.

Image for post
Image for post
Region Proposal Network in Training

The Classifier of Background and Foreground

The first step of training a classifier is make a training dataset. The training data is the anchors we get from the above process and the ground-truth boxes. The problem we need to solve here is how we use the ground-truth boxes to label the anchors. The basic idea here is that we want to label the anchors having the higher overlaps with ground-truth boxes as foreground, the ones with lower overlaps as background. Apparently, it needs some tweaks and compromise to seperate foreground and background. You can check the details here in the implementation. Now we have labels for the anchors.

The second question here is what features of the anchors are.

Let’s say the 600x800 image shinks 16 times to a 39x51 feature map after applying CNNs. Every position in the feature map has 9 anchors, and every anchor has two possible labels (background, foreground). If we make the depth of the feature map as 18 (9 anchors x 2 labels), we will make every anchor have a vector with two values (normal called logit) representing foreground and background. If we feed the logit into a softmax/logistic regression activation function, it will predict the labels. Now the the training data is complete with features and labels.

Another thing you may pay attention to is receptive field if you want to re-use a trained network as the CNNs in the process. Make sure the receptive fields of every position on the feature map cover all the anchors it represents. Otherwise the feature vectors of anchors won’t have enough information to make predictions. Here is a good explanation about receptive field, if you want to know more about it.

In the architecture of Overfeat, it only uses non-overlapping convolutional and pooling filters to make sure every position in the feature map cover its own receptive field without overlapping others. In Faster R-CNN, receptive fields of different anchors often overlap each other, as you can from the above graph. It leaves the RPN to be position-aware. If you want to know the ideas in Overfeat (the first paper about using CNN to do object detection), please check out my previous post about it.

The Regressor of Bounding Box

If you follow the process of labelling anchors, you can also pick out the anchors based on the similar criteria for the regressor to refine. One point here is that anchors labelled as background shouldn’t included in the regression, as we don’t have ground-truth boxes for them. The depth of feature map is 32 (9 anchors x 4 positions).

The paper uses smooth-L1 loss on the position (x ,y) of top-left the box, and the logarithm of the heights and widths, which is as the same as in Fast R-CNN.

Image for post
Image for post
Loss Function of the Regressor

The overall loss of the RPN is a combination of the classification loss and the regression loss.

ROI Pooling

After RPN, we get proposed regions with different sizes. Different sized regions means different sized CNN feature maps. It’s not easy to make an efficient structure to work on features with different sizes. Region of Interest Pooling can simplify the problem by reducing the feature maps into the same size. Unlike Max-Pooling which has a fix size, ROI Pooling splits the input feature map into a fixed number (let’s say k) of roughly equal regions, and then apply Max-Pooling on every region. Therefore the output of ROI Pooling is always k regardless the size of input. Here is a good explanation about ROI Pooling.

With the fixed ROI Pooling outputs as inputs, we have lots of choices for the architecture of the final classifier and regressor.


The paper mentioned two ways: alternatively train the RPN, and the final classifier and regressor; train them at the same time jointly. The later is 1.5 times faster with similar accuracy. The gradients back propagate to the CNNs in both ways.


  1. Fast R-CNN: https://arxiv.org/pdf/1504.08083.pdf
  2. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks: https://arxiv.org/pdf/1506.01497.pdf
  3. py-faster-rcnn: https://github.com/rbgirshick/py-faster-rcnn
  4. A guide to receptive field arithmetic for Convolutional Neural Networks: https://medium.com/@nikasa1889/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807
  5. Region of interest pooling explained: https://blog.deepsense.ai/region-of-interest-pooling-explained/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store