Faster R-CNN: Using Region Proposal Network for Object Detection

Saurabh Bagalkar
Alegion
Published in
8 min readFeb 4, 2020

Introduction

Object detection is a cornerstone of computer vision. It is connected to both image recognition and image segmentation. Where image recognition outputs a classification label for an identified object and image segmentation creates a pixel level understanding of objects in the scene, object detection locates objects within images or videos, allowing them to be tracked and counted. This allows for many of the most popular ML and AI applications such as face detection, autonomous vehicles, video surveillance, and anomaly detection.

One of the most popular object detection methods is the R-CNN series, developed by Ross Girshick et al in 2014, improved upon with Fast R-CNN and then finally with Faster R-CNN. The differentiating approach that makes Faster R-CNN better and faster is the introduction of Region Proposal Network (RPN). RPN is a fully convolutional network, trained end-to-end, that simultaneously predicts object boundaries and object scores at each detection. With RPN being so important to Faster R-CNN, which continues to be one of the best object detection frameworks available to researchers, the bulk of this piece will focus on the RPN design and the concept of anchor boxes and non-maximum suppression.

Need for Speed

R-CNN was introduced in 2014 and gained a lot of interest in the computer vision community. The idea of R-CNN was to use a Selective Search (SS) approach to propose around 2000 Regions-Of-Interest (ROI), which were then fed into a Convolutional Neural Network (CNN) to extract features. These features were used to classify the images and their object boundaries using SVM (Support Vector Machines) and regression methods. For a more detailed explanation, see this piece. This was quickly followed by Fast R-CNN, a faster and better approach of object detection, in early 2015. Fast R-CNN used an ROI pooling approach, which shares the features across the whole image and uses a modified form of spatial pyramid pooling method to extract features in a computationally efficient way. For a more in-depth explanation, see our last piece. The problem with Fast R-CNN is that it is still slow because it needs to perform SS which is computationally very slow. Although Fast R-CNN takes 0.32 seconds as opposed to 47 seconds at test time to do a prediction, it takes 2 seconds for generating 2000 ROI’s. This adds up to 2.3 seconds for each image.

We still spend 2 seconds on each image with selective search. Source

This shortcoming led researchers to come up with Faster R-CNN where the test time per image is only 0.2 seconds with region proposals. This is due to the fact that this latest approach is a fully differentiable model using end-to-end training. Let’s see how it works in detail.

Faster R-CNN

A Faster R-CNN object detection network is composed of a feature extraction network which is typically a pretrained CNN, similar to what we had used for its predecessor. This is then followed by two subnetworks which are trainable. The first is a Region Proposal Network (RPN), which is, as its name suggests, used to generate object proposals and the second is used to predict the actual class of the object. So the primary differentiator for Faster R-CNN is the RPN which is inserted after the last convolutional layer. This is trained to produce region proposals directly without the need for any external mechanism like Selective Search. After this we use ROI pooling and an upstream classifier and bounding box regressor similar to Fast R-CNN.

Unified Network of Faster R-CNN. Image Source

Architecture and Design

As the feature extraction, ROI pooling and classifier are the same as the previous versions, we will focus the bulk of this piece to understand the RPN design and the concept of anchor boxes and non-maximum suppression. The rest of the details are the same as the previous versions

The different layers of Faster R-CNN. Image source

Region Proposal Network

RPN architecture. Image Source

The goal of RPN is to output a set of proposals, each of which has a score of its probability of being an object and also the class/label of the object. RPN can take any sized input to achieve this task. These proposals are further refined by feeding to 2 sibling fully connected layers-one for bounding box regression and the other for box classification i.e is the object foreground or background.

There are 2k classification scores and 4k bbox coordinates generated by RPN where k is the maximum possible number of region proposals. Image source

The RPN, to generate the proposals, slide a small network over the output of the last layer of feature map. This network uses nxn spatial window as input from the feature map. Each sliding window is mapped to a lower dimensional feature.The position of the sliding window provides localization information with reference to the image while the regression provides finer localization information.

Anchor Boxes

Anchor box is the one of the most important concept in Faster R-CNN. These are responsible for providing a predefined set of bounding boxes of different sizes and ratios that are going to be used for reference when first predicting object locations for the RPN. These boxes are defined to capture the scale and aspect ratio of specific object classes you want to detect and are typically chosen based on object sizes in the training dataset. Anchor Boxes are typically centered at sliding window.

The original implementation uses 3 scales and 3 aspect ratios, which means k=9. If the final feature map from feature extraction layer has width W and height H , then the total number of anchors generated will be W*H*k.

Different scales and aspect ratios of anchor boxes. Image source

Need of Anchor Boxes

The main reason to use Anchor Boxes is so that we can evaluate all object predictions at once. They help speed up and improve efficiency for the detection portion of a deep learning neural network framework. Anchor boxes also help to detect multiple objects, objects of different scales, and overlapping objects without the need to scan an image with a sliding window that computes a separate prediction at every potential position as we saw in previous versions of R-CNN. This makes real time object detection possible.

The improvement in speed is made possible because Anchor boxes are translation invariant and so the same ones can be used at every location. As we mentioned before, the information from these anchor boxes is relayed to the regression and classification layer, where regression gives offsets from anchor boxes and classification gives the probability that each regressed anchor shows an object.

How Anchor Boxes help in identifying an object ?

Although Anchors take the final feature map as input, the final anchors refer to the original image. This is made possible because of the convolution correspondence property of CNN’s, thus, enabling extracted features to be associated back to their location in that image. For a down sampling ratio d , the feature map will have dimensions W/d * H/d . In other words, in an image, each anchor point, considering we have just one at each spatial location of feature map, will be separated by d spatial pixels. A value of 4 or 16 is common for d , which also corresponds to the stride between tiled anchor boxes. This is a tunable parameter in the configuration of Faster R-CNN. Making it too low or too high can give rise to localization errors. One way to mitigate this localization errors is to learn the offsets applied to each anchor box which is the goal of the regression layer we discussed above

Downsampling ratios of CNN feature maps. Image source

Anchor boxes at each spatial location, mark an object as foreground or background depending on its IOU threshold with the ground truth. All the anchors are places in a mini-batch and trained using softmax cross entropy to learn the classification loss and smooth L1 loss for regression. We use smooth L1 loss as regular L1 loss function is not differentiable at 0.

Non-maximum suppression(NMS)

NMS is a second stage of filtering to get rid of overlapping boxes as even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes.

NMS pseudocode.

Overview of NMS:

  1. Select the box that has the highest score.
  2. Compute its overlap with all other boxes, and remove boxes that overlap it more than the IOU threshold
  3. Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain. There are further improvements made to this method which is called Soft-NMS

Conclusion

We saw the primary concepts of RPN and anchor boxes in this piece, which are the core ideas that make Faster R-CNN better and faster than its predecessor. We skipped over the Feature extraction and ROI pooling because these concepts are the same that were used in Fast R-CNN and have been explained in depth in our last article. Faster R-CNN is still widely used today and remains one of the best object detection frameworks available to researchers. For full implementation and Tensorflow code, refer to this official github repo.

Learn More

Interested in learning more about machine learning? Check out No BiAS, a podcast about the emerging and ever-shifting terrain of artificial intelligence and machine learning hosted by Melody Travers and Saurabh Bagalkar.

--

--

Saurabh Bagalkar
Alegion
Editor for

Machine Learning Researcher with interest in Computer Vision,Deep Learning, Localization and the field of perception in general