Faster RCNNs Explained

3 min readJul 2, 2023

Faster R-CNN (Region Convolutional Neural Network), a popular object detection algorithm. Faster R-CNN is a two-stage object detection framework that combines a region proposal network (RPN) with a region-based convolutional neural network (R-CNN). Here’s an overview of the components of the architecture:

Let’s delve deeper into each Faster R-CNN architecture module, taking batch size as 8, and 1000 as the number of proposals:

1. GeneralizedRCNNTransform:
This module handles image transformations and resizing. Its primary purpose is to prepare the input images for the rest of the network. The input images are of size [8, 3, 800, 800], where 8 represents the batch size, 3 represents the number of channels (RGB), and 800x800 represents the image resolution. During training or inference, the input images might be of different sizes, so the GeneralizedRCNNTransform ensures that all images are resized and transformed to a consistent size (e.g., 800x800) before passing them through the network.

2. BackboneWithFPN:
The BackboneWithFPN module serves as the backbone network responsible for feature extraction from the input images. It takes the transformed images as input and produces hierarchical feature maps. In this architecture, it appears that the backbone employs ResNet-like architecture, with several convolutional layers, batch normalization, and ReLU activation. The output feature maps have a size of [8, 256, 13, 13], where 8 is the batch size, 256 represents the number of channels, and 13x13 is the spatial resolution of the feature maps.

3. FeaturePyramidNetwork (FPN):
FPN is used to create a feature pyramid by combining features from different scales in the backbone network. The goal is to enable the detector to detect objects at various scales effectively. FPN utilizes skip connections to merge feature maps from different layers of the backbone network. These merged feature maps form the feature pyramid. The output of the FPN is [8, 256, 13, 13], which matches the size of the last layer in the backbone.

4. RegionProposalNetwork (RPN):
The RPN is responsible for generating region proposals (candidate bounding boxes) that are likely to contain objects of interest. It takes the feature pyramid as input and predicts objectness scores and bounding box coordinates for each anchor box. The RPN uses convolutional layers (Conv2d) and sequential layers to perform these predictions. The output of the RPN is a set of 1000 region proposals, each represented as [4], which denotes the coordinates (x, y, width, height) of the bounding box.

5. RoIHeads (Region of Interest Heads):
RoIHeads process the region proposals generated by the RPN for final object detection. The RoIHeads module consists of the following sub-modules:

— MultiScaleRoIAlign (box_roi_pool):
It extracts fixed-sized feature maps for each region proposal from the feature pyramid. This ensures that the features are aligned correctly and provides a consistent representation for each region, regardless of the initial proposal’s size and aspect ratio.

— TwoMLPHead (box_head):
It is a fully connected network that takes the fixed-sized feature maps from MultiScaleRoIAlign and performs region-based classification and bounding box regression. The output of the TwoMLPHead is [8000, 1024], where 8000 represents the number of region proposals, and 1024 is the number of output features.

5. FastRCNNPredictor (box_predictor):
This sub-module takes the features from TwoMLPHead and performs the final classification and bounding box regression. It predicts the class scores (object probabilities) and refines the bounding box coordinates for each region proposal. The output of the FastRCNNPredictor is [8000, 3] for class scores (with 3 classes) and [8000, 12]

In short, The GeneralizedRCNNTransform module prepares the input images, the BackboneWithFPN extracts hierarchical features, the FeaturePyramidNetwork creates a feature pyramid, the RegionProposalNetwork generates region proposals, and the RoIHeads module performs region-based classification and regression to detect objects in the image. Together, these modules enable Faster R-CNN to achieve accurate and efficient object detection.

Follow for more :)

Faster RCNNs Explained

Written by Soumyajit Datta