Faster R-CNN Explained by A.J

3 min readApr 12, 2024

Paper: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks — Ren et al.

Faster R-CNN aimed to address the speed limitations of generating region proposals, improve overall detection speed to enable real-time applications, enhance object detection accuracy by leveraging deep learning for both region proposal and object detection and provide a flexible, scalable architecture that can be easily adapted and improved.

Faster R-CNN Architecture — Source: Papers with Code

FEATURE EXTRACTOR

In this stage, an image is passed through a deep CNN architecture designed for image classification, such as ZF, VGG, or ResNet. For example, if VGG-16 is used for feature extraction, the features from the final convolutional layer (before flattening and passing to the fully connected layers in the original VGG-16 architecture) are utilized. The intuition behind this approach is that the output of the final convolutional layer in VGG provides a deep feature map that represents the input image. This feature map is detailed enough to identify various features across the image.

REGIONAL PROPOSAL NETWORKS (RPN) — The Small Network

A small network with dimensions of 3x3 is passed across the feature map generated by the CNN architecture. For each 3x3 section of the image, the network generates a 3x3xC local representation (where C is the number of channels). For instance, for VGG-16, which requires the input image to be of dimensions 224x224x3, the final feature map will have dimensions of 7x7x512 by convention through a series of convolution and max pooling. The small 3x3 network (assuming a stride of 1) will slide over 25 positions (5 steps in both the x and y directions) of the feature map, creating “25 different” 3x3x512 representations. For each of these 3x3 window, nine predefined translation-invariant anchor boxes are proposed, each with different shape and size, all having the same center, which coincides with the center of the small 3x3 network. Additionally, at each position on the feature map, and for each anchor box, the network generates a score indicating the likelihood that an anchor box contains any object (regardless of its class) versus being part of the background. It also generates a set of values suggesting how to adjust the dimensions and position of the anchor box to better fit the object it’s predicted to contain, if any.

REGIONAL PROPOSAL NETWORK (RPN) — 1x1 Convolutional Layers

Regression Layer (Region Calibration): The suggested adjustments for anchors, along with the original proposed anchors, are fed into this layer, which applies these adjustments to the anchor boxes.
Classification Layer (Objectness): For each anchor box, this layer produces two scores indicating whether it’s an object or not.

Region of Interest pooling (RoI) then resizes the RPN proposal to a fixed size for fully connected layers.

EVALUATION

After the anchor boxes have been adjusted and their objectness scores determined, the anchor boxes overlap because they originally share the same center but differ slightly in size, position, or objectness score. Based on the rank of their objectness scores (from highest to lowest), non-maximum suppression rejects a region if it exhibits an Intersection-over-Union (IoU) overlap with a higher-scoring selected region that exceeds a predetermined threshold.

LIMITATIONS

Faster R-CNN is very dependent on the accuracy of the predefined anchor boxes.
While Faster R-CNN improves efficiency over its predecessors, processing high-resolution images remains computationally intensive due to the dense calculation of convolutional features.

IMPACT

Although Faster R-CNN is a two stage model, it presents a unified framework that integrates region proposal and object detection into a single convolutional network, in contrast to the disjointed approach of earlier model, R-CNN, which relied on the selective search algorithm to extract regions, a convolutional layer for feature extraction, and SVMs for per-class classification.

Reference

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations…

arxiv.org

“May The Data Be With You” — Anonymous