Annotated RPN, ROI Pooling and ROI Align

Published in

The Startup

3 min readJul 4, 2020

In this blog post we will implement and understand a few core components of two stage object detection. Two stage object detection was made popular by the R-CNN family of models — R-CNN, Fast R-CNN, Faster R-CNN and Mask R-CNN.

All two stage object detectors have a couple of major components:

Backbone Network: Base CNN model for feature extraction
Region Proposal Network (RPN): Identifying regions in images which have objects, called proposals
Region of Interest Pooling and Align: Extracting features from backbone based on RPN proposals
Detection Network: Prediction of final bounding boxes and classes based on mult-task loss. Mask R-CNN also predicts masks via an additional head using ROI Align output.

Region of Interest (ROI) Pooling and Alignment connect the two stages of detection by extracting features using RPN proposals and Backbone network. First let’s look at the region proposal network.

Region Proposal Network

The region proposal network takes as input the final convolution layer (or a set of layers in case of UNet kind of architectures). To generate region proposals, a 3x3 convolution is used to generate intermediate output. This intermediate output is then consumed by a classification head and a regression head. The classification head is 1x1 convolution, outputing objectness scores for every anchor per pixel. The regresson head is also a 1x1 convolution that outputs the relative offsets to anchor boxes generated at that pixel.

ROI Pooling

Given a feature map and set of proposals, return the pooled feature representation. In Faster RCNN, the Region proposal network is used to predict objectness and regression box differences (w.r.t to anchors). These offsets are combined with the anchors to generate proposals. These proposals are often the size of input image rather than the feature layer. Thus the proposals need to scaled down to the feature map level.

Additionally the proposals can be of different width, height and aspect ratios. These need to be standardized for a downstream CNN layer to extract features.

ROI Pool aims to solve both these problems. ROI pooling extracts a fixed-length feature vector from the feature map.

ROI max pooling works by dividing the hxw RoI window into an HxW grid of approximately size h/H x w/W and then max-pooling the values in each sub-window. Pooling is applied independently to each feature map channel.

ROI Align

As your see from the implementation of ROIPool, we do a lot of quantization (i.e ceil, floor) operations to map the generated proposal to exact x,y indexes (as indexes cannot be floating point). These quanitizations introduce mis-alignments b/w the ROI and and extracted features. This may not impact detection/classification which is robust to small pertubations but has a large negative effect on predicting pixel-accurate masks. To address this ROI Align was proposed which removes any quantization operations. Instead bi-linear interpolation is used to compute the exact values for every proposal.

Similar to ROIPool, the proposal is divided into pre-fixed number of smaller regions. Within each smaller regions, 4 points are sampled. The feature value for each sampled point is computed with bi-linear interpolation. Max or average operation is carried out to get final output.

Bi-Linear Interpolation

Bi-linear interpolation is a common operation in computer vision (esp. while resizing images). Bi-linear interpolation works by doing two linear interpolations in x and y dimension in sequence (order of x and y does not matter). That is, we first interpolate in x-axis and then in y-axis. Wikipedia provides a nice review of this concept.

Now that we understand how bi-linear interpolation works, let’s implement ROIAlign.

Summary

In this post, we implemented a few components of modern object detection models and test then out (see blog). Going through the work of implementing these components helps me better understand the reasoning behind their development. Of course, one would always rely on cuda implementation in actual research work. A logical next step would be to implement the remaining components of two stage object detection and test it out.

Originally published at https://kaushikpatnaik.github.io on July 4, 2020.