Mask R-CNN

3 min readJan 8, 2019

Mask R-CNN is a deep neural network aimed to solve instance segmentation problem in machine learning or computer vision. In other words, it can separate different objects in an image or a video. You give it an image, it gives you the object bounding boxes, classes, and masks. So, now the question is what is instance segmentation?

Instance segmentation is the task of identifying object outlines at the pixel level. Compared to similar computer vision tasks, it’s one of the hardest possible vision tasks. Consider the following tasks:

Classification: There is a balloon in this image.
Semantic Segmentation: These are all the balloon pixels.
Object Detection: There are 7 balloons in this image at these locations. We’re starting to account for objects that overlap.
Instance Segmentation: There are 7 balloons at these locations, and these are the pixels that belong to each one.

Mask R-CNN (regional convolutional neural network) is a two-stage framework: the first stage scans the image and generates proposals(areas likely to contain an object). And the second stage classifies the proposals and generates bounding boxes and masks. Both stages are connected to the backbone structure.

What is backbone? This is a standard convolutional neural network (typically, ResNet50 or ResNet101) that serves as a feature extractor. The early layers detect low-level features (edges and corners), and later layers successively detect higher level features (car, person, sky).

Passing through the backbone network, the image is converted from 1024x1024px x 3 (RGB) to a feature map of shape 32x32x2048. This feature map becomes the input for the following stages.

While the backbone described above works great, it can be improved upon. The Feature Pyramid Network (FPN) was introduced by the same authors of Mask R-CNN as an extension that can better represent objects at multiple scales.

FPN improves the standard feature extraction pyramid by adding a second pyramid that takes the high-level features from the first pyramid and passes them down to lower layers. By doing so, it allows features at every level to have access to both, lower and higher level features.

First stage: A lightweight neural network called the region proposal network (RPN) scans all FPN top-bottom and proposes regions which may contain objects. While scanning feature map is an efficient way, it needs a method to bind features to its raw image location. The solution is called the anchors. Anchors are a set of boxes with predefined locations and scales relative to images. Ground-truth classes (only object or background binary classified at this stage) and bounding boxes are assigned to individual anchors. As anchors with different scales bind to different levels of feature map, RPN uses these anchors to figure out whereof the feature map ‘should’ get an object and what size of its bounding box is.

Second stage: The procedure looks similar to RPN, the only differences are that without the help of anchors, this stage used a trick called ROIAlign to locate the relevant areas of feature map, and there is a branch generating masks for each object in pixel level. ROIAlign, in which they sample the feature map at different points and apply a bilinear interpolation.

The most interesting things I found about Mask R-CNN is that we could actually force different layers in a neural network to learn features with different scales, just like the anchors and ROIAlign, instead of treating layers as black box.

Here are the results of my implementation using Mask R-CNN in TensorFlow:

References:

Mask R-CNN

Written by Tiba Razmi