Mask R-CNN Unmasked
Released in 2018, Mask R-CNN, developed by Kaiming He and his team at FAIR is one of the most powerful algorithms for instance segmentation. At Fractal.ai, we use Mask R-CNN framework for solving various business problems. This blog post helps users to understand how it works.
Mask R-CNN became one of the most powerful object recognition algorithm in our stack and its variant s (with some modifications to the original paper) were extensively used here by Fractal image team in various use cases. Mask R-CNN is both powerful and complex as well. The above diagrams give a glimpse of how data flows through the Mask R- CNN algorithm. We believe proper understanding of Mask R-CNN is very important to tune its parameters and exactly know where to use this algorithm and where not to use it.
What Does Mask R-CNN do?
Mask R-CNN is an extension of Object detection algorithm Faster R-CNN with an extra mask head. The extra mask head allows us to pixel wise segment each object and also extract each object separately without any background (which is not possible by semantic segmentation).
How Successful is Mask R-CNN?
- Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
- Mask R-CNN outperformed all existing, single model entries on every task, including the COCO 2016 challenge winners.
- It gives bounding boxes and segmentation masks on each and every object leading to instance segmentation.
Mask R-CNN is evolved from its previous algorithms. First in the order:
- R-CNN used selective search to select proposal in a image. each proposal is sent through the deep learning model and a 2048 vector is extracted. Independent classifiers are trained for each class on these vectors to classify the objects.
- Fast R-CNN removed training on SVM classifiers and used a regression layer and classification layer. They also have applied selective search on feature map instead of image, thus eliminating the need of sending each proposal through the entire network.
- Faster R-CNN removed selective search and used a deep convolution network called RPN (region proposal network) to generate proposals thus allowing to train a end to end neural network in a single stage.
- Mask R-CNN is an extension of Faster R-CNN with an additional module to generate high quality segmentation masks for each image.
The purpose of this blog post is to give an in-depth explanation about the model and investigate the changes from faster R-CNN and how it brought improvements and new features to the algorithm.
A detailed analysis on evolution is written here
Faster R-CNN is a two-stage detector. In the first stage, It generates a set of proposals which have the higher probability of an object being present. Those proposals are then passed through a detection network, on which the classes and the bounding box regression offsets are predicted. This blog assumes that readers understood Faster R-CNN. If that is not the case, kindly go through the following blogs, if you are not clear about concepts of faster R-CNN.
Guide to build Faster RCNN in PyTorch
Understanding and implementing Faster RCNN from scratch.
Object Detection and Classification using R-CNNs
In this post, I'll describe in detail how R-CNN (Regions with CNN features), a recently introduced deep learning based…
According to its research paper, similar to its predecessor, Faster R-CNN, It is a two stage framework: The first stage is responsible for generating object proposals, while the second stage, classifies proposals to generate bounding boxes and masks. The detection branch (classification and bounding box offset) and mask branch run in parallel to each other. The classification to a particular class does not depend on the mask predictions. We, however believe that its is a three stage framework instead, with generation of boxes and masks being two different stages, as we don’t generate masks on all the proposals of the RPN, but only on the detection's we get from the box head.
There are three major changes from faster R-CNN to mask R-CNN:
1. Use of FPN’s
2. Replacement of ROIPool with ROIAlign
3. Introduction of additional branch to generate masks.
Feature pyramid networks have been an immensely effective backbones and have improved the accuracy in many object detection frameworks (eg: Retinanet). These have been integrated in mask R-CNN to generate ROI (region of interest) features.
Another major change from faster -R-CNN would be replacement of ROI Pool with ROI Align. ROI Pool did work quite well in case of object detection, it involved coarse quantization steps, which rendered disastrous in case of generating mask predictions.
We will discuss about FPN and ROI align in-detail in this blog going further. First, let’s understand how image is pre-processed before sending it into the network.
Some pre-processing steps are applied to the image before it is sent to the network.
- Subtraction of mean: The mean vector (3 X 1, corresponding to each color channel) is the mean of pixel values across all the training and test images is subtracted from the input image.
- Re-scale: Two parameters are taken (target size and maximum size). The shorter side is resized to target size and longer side is resized accordingly keeping aspect ratio conserved. However, if the new value for longer side exceeds the maximum size, then longer side is resized to maximum value and resized value of shorter side changed with reference to longer side, keeping aspect ratio conserved. The default values for target size and maximum size are 800 and 1333 respectively.
- Padding: It is necessary when feature pyramid networks (FPN’s) are involved (explained in the next section). All the padding is done to the right-most and bottom-most edges only, so there is no change required in coordinates of targets, as the coordinate system is based from the uppermost-left corner. This step isn’t necessary if we are not using FPN’s.
Lets understand this by an example:
The minimum of two sides (600) is re-scaled to 800 here and the other dimension is resized with the same scale. Since, the newly scaled side(1200) is not a multiple of 32, it is padded such that the resultant size is its multiple (1216/32=38).
NOTE: The image height and width for anchor generation and filtering steps will be taken as that of resized image and not the one after the padding.
Feature pyramid networks (FPN) backbone
Faster R-CNN uses standard Resnet kind of architecture for encoding the image. At every layer the feature maps size is reduced by half and number of feature maps are doubled. We are extracting features from 4 feature maps from the resnet-50 architecture (layer-1, layer-2, layer-3, layer-4 outputs) as shown in the diagram. To generate final feature maps, we use an approach called top-bottom pathway. We start from the smallest feature map and work our way down to bigger ones, by upscale operations.In the diagram we can see that, the feature map generated by layer 2 is first subjected to 1 X 1 convolutions to bring down the number of channels to 256. This is then added element-wise to the up-sampled output from the previous iteration. All the outputs of this process are subjected to 3 X 3 convolution layer to create final 4 feature maps(P2, P3, P4, P5). The 5th feature map (P6) is generated from a max pooling operation from P5.
Notice the dimensions here, The size smallest feature map involved in up-sample operation is (w/32, h/32). This makes it important for us to make sure that input tensor has dimensions divisible by 32. Lets take an example, let w=800 and h=1080. w/32=25, h/32 = 33.75, this implies the feature map will be of size (25,33). On up-sampling, dimensions would be (50,66), which is supposed to be element-wise added to another feature map of size (50,67). This will throw an error. Due to this, if we are using FPN’s, the input tensor has to be a multiple of 32 in this case.
Region proposal network (RPN)
Each of the feature maps generated is passed through a 3 X 3 convolution layer. The resulting output is passed to two branches, one pertaining to object scores and other to bounding box regressors. We only use only one anchor stride here for a feature pyramid (as we already have features of different sizes in pyramid to take care for objects of different sizes) and 3 anchor ratios. Hence 3 channels in objectness and 3*4 channels in bounding box regressor.
Please note that all anchor ratios, strides etc must be exactly same for training and testing, as the objectness scores and bounding box regressor channels correspond to the anchor ratios used during training.
The values of RPN bounding box predictions are independent of dimensions of feature map and will be required to multiplied by image height and width of image during decoding step.
Generation of proposals
- Pre-NMS top k: Selection of top k (default k for train =12000, test=6000) anchors based on their corresponding objectness score.
2. Box decoding: The anchor boxes’ coordinates are modified according to rpn_bbox_pred values obtained from RPN head. In this diagram, width and height are dimensions of resized image (Read NOTE in image pre-processing section)
3. Removal of invalid boxes: Remove the bounding boxes in which either of coordinate lies outside the image, or images with negative height and width.
4. NMS: Non-maximum suppression is applied to remove boxes which which have overlap more than threshold (RPN NMS threshold, default value=0.7).
5. Concatenate: Steps 1–4 have been performed separately for each feature of pyramid generated by FPN backbone. In this step, All the anchor boxes(already in coordinate system of original resized image) are grouped together. We are NOT storing the information of which ROI came from which FPN layer.
6. FPN post NMS top N: This step is different for training and testing. In case of training, we select top N (default = 2000) proposals based on their corresponding objectness score from all proposals of all images in an entire batch. In case of testing, N (default=2000) proposals are selected for each image in the batch and kept separately.
- FPN-ROI mapping: The purpose of FPN-ROI mapping is to associate the appropriate feature map of FPN to a particular ROI based on its area. By mentioning pooler scales, we have the liberty to use less number of pooler scales if using ROI align on all feature maps doesn’t suit our purpose.
The resnet-50-FPN example (Also the official implementation of resnet-50-fpn at facebook_maskrcnn_benchmark)generates 5 feature maps, all 5 of them are used in RPN to generate proposals, However, only 4 (P2, P3, P4, P5) are used during while associating with ROI’s.
From equation 1 (taken from feature pyramid networks research paper) we get an integer level corresponding to that particular ROI. In this example lvl_min is 2 and lvl_max is 5. If the target_lvl value is lets say, 3, then it will be associated to P3. If target_lvl value is less than 2 or greater than 5, it is clamped to 2 and 5 respectively.
2. ROI Align: One of the most significant changes from faster-r-cnn would be replacement of ROI pooling with ROI align. Both the operations generate a uniform P X P matrix for all the ROI’s. ROI Pool works well in case of object detection, but fails terribly in case of instance segmentation, as we have too many quantization steps which affect generation of masks, where pixel to pixel correspondence matters. Please refer to appendix for a detailed explanation. For now, let’s consider ROI Align/Pool gives a constant output of P X P irrespective of the proposal size.
3. Fully connected layers: The output for each ROI is reshaped to be passed through a fully connected layer, for which the number of out channels is a hyper parameter called representation size(default:1024). This is passed through another FC layer with same number of in and out channels. We get different 1024 length vectors for all the ROI’s.
The feature extracted here we have shown is for Resnet-50 FPN. This will change if use some different backbone, but overall, all use ROI Align and generate a output vector of same size for each ROI.
The ROI vectors are passed through a predictor, containing 2 branches, each with a FC layer. One for predicting the object class and other for bounding box regression values. We also have an option to opt for class agnostic bounding box regression here, where num_bbox_classes=2, i.e foreground and background, else num_bbox_classes will be same as number of classes in classification branch.
- Pre-NMS threshold: To filter the detections before NMS, based on class probability scores (default=0.5)
- Decoding: Similar to what was done in case of RPN. Please note that we have box_bbox_reg values for each class separately for each proposal (eg: proposal size = [1000,4], box_bbox_reg size=[100,81(classes)*4]). We will be applying decoding step 81 times for a proposal in this example.
- Removal of invalid boxes: Same as RPN.
- NMS: We define a NMS threshold (default=0.5). The resultant proposals for each class are separated and NMS is done separately for each class.
- Filtering top detections: The maximum detections combined for all classes in a image, i.e, there will not be detections more than this number in an image. For COCO dataset, the value is set to be 100.
The process of feature extraction of masks is similar to boxes, One key difference is the absence of fully connected layers which were present in box head as reshaping before the fully connected layers loses the spatial structure information necessary to generate masks. Like we used ROI proposals as input for box head feature extractor, we use the detected bounding boxes as an input here. The FPN ROI mapping is similar to box head again.
The ROI align outputs are passed through a series of 3 X 3 convolutional layers, followed by ReLU, whose number of out channels of each layer and number of layers is a hyper parameter (for Resnet-50-FPN it is set to be (256,256,256,256)). For each detection, we get a output tensor of same dimensions as ROI align output.
The extracted features are passed through the predictor, which involves an deconv operation. This is subjected to a 1 X 1 convolutional layer, with number of out channels being equal to number of classes, one mask for each class, for each detection.
We have already predicted the class of that detection, for each proposal, we select the channel according to the class, reducing the dimensions to [D, 1, 2P, 2P].
Post-processing of masks:
The masks obtained can be resized according to the input image. The mask tensors are padded, the reason being to avoid boundary effects caused due to the up-sample (deconv) operation we had earlier. The bounding box coordinates are re-scaled according to the new mask and converted to nearest integer (The coordinates are with respect to input image). The mask is also re-scaled according to the image size, the interpolation method has been used is bi-linear. Mask threshold is a hyper parameter (default: 0.5). For each pixel, if the value is above 0.5, the object is assumed to be present in that pixel, else, absent. We finally get a [image_height, image_width] mask of the object, similarly for all the objects of image.
Appendix — ROI Pool and ROI Align
Let’s understand it with an example. Lets have a look at ROI pool first:
Let’s assume a 5 X 5 feature map, with the ROI plotted on it. Notice that boundaries of ROI don’t coincide with the granularity of feature map (As ROI’s have coordinates w.r.t to original image, and feature map has a lower resolution than the input image).
To resolve the above mentioned issue, the boundaries of ROI are rounded off to match the granularity of the feature map. Now the boundaries of ROI align with that of the granularity of the feature map.
Depending on the output size (m) we mention, we get a m x m matrix, we divide the ROI into bins depending on the value of m. In this example, we have taken m=2. Notice that the boundaries of bins again don’t align with that of feature map.
We perform the quantization step again, similar to step 1. We apply max-pool operation in each bin separately. So the final four values in this example come out to be (0.32,0.64,0.16,0.25).
We get the final m X m matrix for all the ROI’s.
This type of pooling method works well in case of object detection, but fails terribly in case of instance segmentation, as we have too many quantization steps which affect generation of masks, where pixel to pixel values matter. Let’s now have a look at ROI Align with a similar example.
We have the same initial ROI. Notice that we DON’T perform any quantization operations, both on the ROI and the bins.
We take 4 (2*2) sample points in each bin. The Mask R-CNN research paper says that the position and number of sample points don’t matter so much. The equation below explains how they have selected sample points based on facebook maskrcnn_benchmark repo.
The value of num_samples has been taken 2 . X_low and X_high are there lowest and highest x coordinates of the bin. Using this equation we get coordinates of 4 points (S1, S2, S3, S4)
Let’s pick S1 for example, We identify 4 nearest points on feature map which match to resolution of feature map (whose coordinates will be integers). In this case P(2,1), Q(3,1), R(2,2) and S(3,2), with their respective values. Using the 4 points and the equation below, we employ bi-linear interpolation to calculate the feature value at that particular sample point. So, for S1, we get a value of 0.21778.
We repeat the process for other sample points, we get different values for each point. We perform average pooling for all the sample points in the bin. According to research paper, we can also use max-pool instead without any significant change in results, but they have used average pooling in their official implementation.
We repeat the process for all bins and get a final output of dimensions same for all ROI’s.
This is it. We hope that the readers have understood the internal workings of Mask R-CNN. Written by Sarang Pande. Special thanks to Fractal Image Team Suraj Amonkar, Prakash Jay, Sachin Chandra, Vikash Challa, Rajneesh Kumar, thanuj raju pilli, saksham jindal, Abhishek Chopde, Sindhura K, praneeth chandra, Samreen Khan.