Review: Mask R-CNN (Instance Segmentation & Human Pose Estimation)

Outperforms MNC and FCIS in Instance Segmentation; as well as CMU-Pose and G-RMI in Human Pose Estimation

Published in

Analytics Vidhya

8 min readApr 6, 2020

In this story, the very famous Mask R-CNN, by Facebook AI Research (FAIR), is reviewed. Mask R-CNN is easy to generalize to many tasks such as instance segmentation, bounding box object detection, and person keypoint detection. It is developed by Detectron in FAIR, which is the FAIR software system to enable numerous research projects.

**Mask R-CNN for instance segmentation** (Image from Authors’ Paper)

Mask R-CNN is one of the quite important deep learning based computer vision papers in the literature. It outperforms all existing, single-model entries (at that moment) on every task, including the COCO 2016 challenge winners. In the COCO 2017 challenge, the winners’ networks are also based on Mask R-CNN.

This is a 2017 ICCV paper with over 5000 citations. And it got Marr Prize at 2017 ICCV. (Sik-Ho Tsang @ Medium)

Outline

What is Instance Segmentation?
From R-CNN, Fast R-CNN, Faster R-CNN, to Mask R-CNN
Mask R-CNN Network Overview & Loss Function
RoIAlign
Head Structure Details
Instance Segmentation Results
Human Pose Estimation (Keypoint Detection) Results

1. What is Instance Segmentation?

Classification: Just classify the main object in the image.
Classification + Localization: We wanna know the bounding box of the main object as well.
Object Detection: There are multiple objects in the image, and we want to know the class and the bounding box of each object for all known classes.
Instance Segmentation: Classify individual objects and localize each using a bounding box.

2. From R-CNN, Fast R-CNN, Faster R-CNN, to Mask R-CNN

To understand Mask R-CNN network architecture well, it is better to understand from R-CNN.
(It is better to have a basic understanding of R-CNN, Fast R-CNN, Faster R-CNN first. Here’s just a simple review.)

2.1. R-CNN

In R-CNN, at the bottom of the network, non deep learning based selective search (SS) is used for feature extraction to generate 2k region proposals.
Each region proposal is warped and goes through the Convolutional Neural Network (CNN), and Support Vector Machine (SVM) at the end, to output the classification and bounding box.
(If interested, please read R-CNN for more details.)

2.2. Fast R-CNN

**Fast R-CNN** (Image from Authors’ PPT)

In Fast R-CNN, SS is still used for generating 2k region proposals.
But, different from R-CNN, the input image goes through CNN for feature extraction to generate the feature maps. These feature maps are shared for RoI pooling according to each region proposal afterwards.
For each region proposal, RoI pooling is performed on the proposal to goes through the network at the end, i.e. the fully connected (FC) layers. And no more SVM is used.
Finally, classification and bounding box are outputted at the output of the fully connected (FC) layers.
But the region proposal part is still using non deep learning based SS approach.
(If interested, please read Fast R-CNN for more details.)

2.3. Faster R-CNN

**Faster R-CNN** (Image from Authors’ PPT)

In Faster R-CNN, the input image goes through the CNN. These feature maps will used for the region proposal network (RPN) to generate region proposals, and used for generate feature maps for RoI pooling at later part.
In such case, SS is no longer used. Instead, a single CNN is used. Thus, the whole network is an end-to-end deep learning network, which is essential for gradient propagation to improve the object detection accuracy.
Similar to Fast R-CNN, for each region proposal, RoI pooling is performed on the proposal to goes through the network at the end, i.e. the fully connected layers. And finally, classification and bounding box are outputted.
(If interested, please read Faster R-CNN for more details.)

2.4. Mask R-CNN

**Mask R-CNN** (Image from Authors’ PPT)

In this story, Mask R-CNN, the architecture is very close to the Faster R-CNN. The main difference is that, at the end of the network, there is another head, i.e. the mask branch in the above figure, to generate the mask for instance segmentation.

3. Mask R-CNN Network Overview & Loss Function

3.1. Two-Stage Architecture

Two-stage architecture is used, just like Faster R-CNN.
First stage: Region Proposal Network (RPN), to generate the region proposals or candidates. Each region proposal will go through the second stage.
Second stage: For each region proposal, the feature maps proposed at the first stage is RoI pooled according to the region, and goes through the remaining network, outputs the class, bounding box, and also the binary mask.
More detailed network architecture is mentioned at Section 5.

3.2. Loss Function

Therefore, the loss function is a multi-task loss:

Lcls: The classification loss, same as Faster R-CNN.
Lbox: The bounding box loss, same as Faster R-CNN.
Lmask: The binary mask loss. This mask branch output Km² for each RoI, which are K binary masks of m×m resolution, which represents K number of classes.

3.3. Lmask

Per-pixel sigmoid is applied.
Average binary cross-entropy loss is used for Lmask.
Lmask is only accounted for k-th mask for the RoI with the mask with ground truth class k. Other classes will not contribute to the loss.
Thus, it is different from Fully Convolutional Network (FCN). In FCN, per-pixel softmax and multinomial cross-entropy loss is used. In contrast here, mask and class prediction are decoupled as Lcls and Lmask.
To extract the spatial structure of the mask, the m×m mask for each RoI is predicted using FCN. The advantage of not using FC layer is that fewer parameters are required.

4. RoiAlign

4.1. RoIPool in Faster R-CNN

**RoIPool in** **Faster R-CNN** (Image from Authors’ PPT)

An example of RoIPool in Faster R-CNN is as shown above.
First, we got the input feature map as at the left of the figure.
According to the region proposal, we a 7×5 region is used as input to RoIPool to output the 2×2 feature map.
Each black rectangle is rounded to have integer-length for later pooling.
For each value of output feature map, they are corresponding to the maximum value of each black rectangular, which is known as max pooling.

4.2. RoIAlign in Mask R-CNN

**RoIAlign in Mask R-CNN** (Image from Authors’ PPT)

An example of RoIAlign in Mask R-CNN is as shown above.
Instead of rounding the black rectangles to have integer-length, black rectangles of equal size are used.
Based on the area overlapping by the feature map values, bilinear interpolation is used to obtained an intermediate pooled feature map, which as shown at the bottom right of the figure.
Then max pooling is performed on this intermediate pooled feature map.

5. Head Structure Details

**Network Architecture Variants** (Image from Authors’ PPT)

3 backbones are tried: ResNet, ResNeXt and FPN.
Left: When ResNet/ResNeXt is used without FPN, further convolution is performed first before splitting into two heads. One head for classification and bounding box and one head for mask.
Right: When ResNet/ResNeXt is used with FPN, the network is directly splitting into two heads. One head for classification and bounding box and one head for mask.

6. Instance Segmentation Results

Dataset: MS COCO, 80 classes, 80k train images, 35k subset of val images, 5k images for ablation experiments.

6.1. Ablation Study

Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.

Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).

RoIWarp is originally used in R-CNN, RoIPool is originally used in Fast R-CNN already.
RoIAlign layer improves AP by 3 points and AP75 by 5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers.

ResNet-50-C5, of stride 32, is used.
Misalignments are more severe than with stride-16 features, resulting in massive accuracy gaps.

**Mask Branch Variants Using** **ResNet-50-FPN**

FCNs improve results compared to MLP (Multi-Layer Perceptron, using FC layers) as they take advantage of explicitly encoding spatial layout.

6.2. Qualitative Results

**Mask R-CNN on COCO test images, using** **ResNet-101-FPN** (Image from Authors’ paper)

6.3. SOTA Approaches Comparison

**Instance segmentation mask AP on COCO test-dev**

MNC and FCIS are the winners of the COCO 2015 and 2016 segmentation challenges, respectively.
Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal flip test, and OHEM. All entries are single-model results.

7. Human Pose Estimation (Keypoint Detection) Results

The Mask R-CNN framework can easily be extended to human pose estimation.
A keypoint’s location is modeled as a one-hot mask. Mask R-CNN predicts K masks, one for each of K keypoint types (e.g., left shoulder, right elbow).

7.1. Ablation Study

Again, using proper alignment is the only factor that contributes to the large gap between RoI layers.

**Multi-task learning of box, mask, and keypoint about the person category, evaluated on minival**

Adding the keypoint branch reduces the box/mask AP slightly, suggesting that while keypoint detection benefits from multitask training, it does not in turn help the other tasks.
Nevertheless, learning all three tasks jointly enables a unified system to efficiently predict all outputs simultaneously.

7.2. Qualitative Results

**Keypoint detection results on COCO test using Mask R-CNN (ResNet-50-FPN**) (Image from Authors’ paper)

7.3. SOTA Approaches Comparison

**Keypoint detection AP on COCO test-dev.**

CMU-Pose+++ is the 2016 competition winner that uses multi-scale testing and post-processing while G-RMI is also the 1st runner up in 2016.
Mask R-CNN (62.7 APkp) is 0.9 points higher than the COCO 2016 keypoint detection winners.
Using mask labels for training can also help to increase the keypoint detection AP.

There are also results for semantic segmentation using Cityscape dataset in the paper, and enhanced results on COCO at the appendix of the paper. If interested, please read the paper for more details.

Reference

[2017 ICCV] [Mask R-CNN]
Mask R-CNN

Instance Segmentation

[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS] [Mask R-CNN]

Human Pose Estimation

[DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM] [FCGN] [IEF] [DeepCut & DeeperCut] [Newell ECCV’16 & Newell POCV’16] [G-RMI][CMUPose & OpenPose] [Mask R-CNN]