Understanding Mask R-CNN

Aparna Singh
DeveLearn
Published in
5 min readFeb 19, 2021

The slightly slower but far better brother to Faster R-CNN

Mask R-CNN is a segmentation model instance that enables us to define pixel wise position for our class. “Instance segmentation” means segmentation of individual items within a scene, regardless of whether they are of the same type — i.e. identifying individual vehicles, individuals etc. Check out a Mask R-CNN model trained on the COCO dataset below. We can classify pixels of the location as you can see.

Spectacular result of instance-level segmentation by Mask R-CNN in the wild. Source

Mask R-CNN is different from classical object detection models such as Faster R-CNN where, besides defining the class and its bounding box position, pixels in the bounding box that correspond to that class can also be colored. When do you think we should need this more detail? i might think of such examples as:

Self-driving cars need to know the exact position of the road; probably even other cars to avoid collisions .

Robots can need to locate pixels of objects they want to collect.

How Mask R-CNN works?

Lets understand how it actually works:

A good way to think about Mask R-CNN is to combine a Faster R-CNN that senses objects (class + bounding box) with FCN (Fully Convolutional Network) that do pixel wise boundaries. See below for figure:

Mask R-CNN is conceptually simple: for each candidate object, Faster R-CNN has two outputs, a class mark and an offset bounding box; to this we add a third branch that outputs the object mask — a binary mask that displays the pixels in the bounding box.

But the additional mask output is distinct from the outputs of class and box, which involves the extraction of an object’s much finer spatial structure. Uses the Fully Convolution Network (FCN) mentioned below to do this Mask R-CNN.

FCN is a common algorithm used to perform semantic segmentation. This model uses different convolution blocks and max pool layers to first decompress an image up to 1/32th of its original size. At this point of granularity it then makes a class prediction.

Eventually it resizes the image to its original dimensions using sampling and de-convolution layers.

So, in short, we can assume that in one mega architecture, Mask R-CNN incorporates the two networks — Faster R-CNN and FCN. The model’s loss function is the cumulative loss in classification, bounding box generation, and mask generation.

Mainly it consists of three points:

  1. Classification

2. Generating bounding box

3. Giving mask to an image

Here, you can see that every entity has been segmented (which are the cells in this particular image). This is how segmentation of images works.

We also addressed the two forms of segmentation of images: semantic segmentation, and segmentation of instances. Let’s take an example once again to grasp those two types:

All 5 items in the picture on the left are human. Semantic segmentation would therefore identify all individuals as a single case. Now the picture on the right has 5 objects as well (all of which are people). Yet here they allocated specific objects of the same class as separate instances. This is an example of segmentation by case.

Understanding Mask R-CNN

Mask R-CNN is basically a Faster R-CNN extension. Faster R-CNN is commonly used for object detection tasks. For a given image, it returns coordinates of the class mark and bounding box for each object in the image. So, let’s assume you pass the picture below to a Faster R-CNN network:

The Mask R-CNN system is built on top of Faster R-CNN. So Mask R-CNN will also return the object mask for a given image, in addition to the class mark and bounding box coordinates for each object.

First let’s understand quickly how Faster R-CNN functions. It will also allow us to understand the theory behind Mask R-CNN.

1. Faster R-CNN uses ConvNet first to draw feature maps from the images .

2. Such function maps are then passed through a Region Proposal Network (RPN) that returns the bounding boxes for the candidate .

3. We then add a RoI pooling layer on these boundary boxes to put all the candidates to the same size.

4.And eventually, the proposals are transferred to a completely integrated layer for classifying and outputting object bounding boxes.

When you understand how Faster R-CNN operates, it’ll be very easy to understand Mask R-CNN. So, let ‘s understand it step-by-step from the input to the class mark, bounding box, and object mask predict.

Backbone Model

Similar to the ConvNet we use in Faster R-CNN to extract feature maps from the image, we use ResNet 101 architecture to extract features from the images in Mask R-CNN. So, the first step is to use ResNet 101 architecture to take an image and extract features from it. These characteristics serve as an input for the next layer.

Region Proposal Network (RPN)

Now, we‘re taking the feature maps that were obtained in the previous phase and implementing a regional proposal network (RPM). That basically predicts whether (or not) an entity is present in that area. In this step, we get those regions or feature maps that the model predicts have some object in them.

Region of Interest (RoI)

The regions obtained from the RPN can be of various types, right? Therefore, a pooling layer is added and all regions are converted to the same shape. Such regions are then passed through a fully linked network to predict the class mark and bounding boxes.

The steps are nearly identical to how Faster R-CNN functions up until this point. Then comes the difference between the two systems. In addition to this, Mask R-CNN also generates the segmentation mask.

Steps to implement Mask R-CNN:

Step 1: Clone the repository from the GitHub

Step 2: Install the dependencies

Step 3: Download the pre-trained weights (trained on MS COCO)

Step 4: Predicting for our image

Image segmentation has a wide range of applications, ranging from the healthcare industry to the manufacturing industry.

The original paper can be found on Arxiv.

--

--