From R-CNN to Mask R-CNN

6 min readFeb 15, 2018

Region Based Convolution Neural Network

The purpose of R-CNNs(Region Based Convolution Neural Network) is to solve the problem of object detection. Given a certain image, we want to be able to draw bounding boxes over all of the objects. The process can be split into two general components, the region proposal step and the classification step

The authors note that any class agnostic region proposal method should fit. Selective search is used in particular for RCNN. Selective Search performs the function of generating 2000 different regions that have the highest probability of containing an object. After we’ve come up with a set of region proposals, these proposals are then “warped” into an image size that can be fed into a trained CNN (AlexNet in this case) that extracts a feature vector for each region. This vector is then used as the input to a set of linear SVMs that are trained for each class and output a classification. The vector also gets fed into a bounding box regressor to obtain the most accurate coordinates.

Non-maxima suppression is then used to suppress bounding boxes that have a significant overlap with each other.

Object detection with R-CNN

It consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detection avail-able to detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of class- specific linear SVMs.

Model Design

Region Proposals

R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work

Feature Extration In order to compute features for a region proposal, we must first convert the image data in that region into a form

that is compatible with the CNN

Training

Supervised per-training

Pre-trained CNN on a large auxiliary dataset using image-level annotations

Domain-specific fine-tuning.

To adapt thisCNN to the new task (detection) and the new domain (warped proposal windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals.

Object category classifiers.

After features extraction and training labels are applied, optimize one linear SVM per class. standard hard negative mining method is being used.

Fast R-CNN

Deep ConvNets have significantly improved image classification and object detection accuracy. The Region-based Convolutional Network method achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN,however, has notable drawbacks

1.ConvNets to SVMs to bounding box regressors

2.Training is expensive in space and time.

3.Object detection is slow.

Fast R-CNN was able to solve the problem of speed by basically sharing computation of the conv layers between different proposals and swapping the order of generating region proposals and running the CNN. In this model, the image is firstfed through a ConvNet, features of the region proposals are obtained from the last feature map of the ConvNet (check section 2.1 of the paper for more details), and lastly we have fully connected layers as well as our regression and classification heads.

Architecture and Training

An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.

Faster R-CNN

Faster R-CNN works to combat the somewhat complex training pipeline that both R-CNN and Fast R-CNN exhibited.

The authors insert a region proposal network (RPN) after the last convolutional layer. This network is able to just look at the last convolutional feature map and produce region proposals from that. From that stage, the same pipeline as R-CNN is used (ROI pooling, FC, and then classification and regression heads).

Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.The entire system is a single, unified network for object detection

Mask R-CNN — Extending Faster R-CNN for Pixel Level Segmentation

So far, we’ve seen how we’ve been able to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes.

Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN

Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image), as before, is just a Fully Convolutional Network on top of a CNN based feature map. Here are its inputs and outputs:

Inputs: CNN Feature Map.
Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).

But the Mask R-CNN authors had to make one small adjustment to make this pipeline work as expected.

When run without modifications on the original Faster R-CNN architecture, the Mask R-CNN authors realized that the regions of the feature map selected by RoIPool were slightly misaligned from the regions of the original image. Since image segmentation requires pixel level specificity, unlike bounding boxes, this naturally led to inaccuracies.

This problem by cleverly adjusting RoIPool to be more precisely aligned using RoIAlign. In RoIPool, we would round this down and select 2 pixels causing a slight misalignment. However, in RoIAlign, we avoid such rounding. Instead, we use bilinear interpolation to get a precise idea of what would be at pixel 2.93. This, at a high level, is what allows us to avoid the misalignments caused by RoIPool.

Once these masks are generated, Mask R-CNN combines them with the classifications and bounding boxes from Faster R-CNN to generate such wonderfully precise segmentation: