Taming the Hyper-Parameters of Mask RCNN

Published in

Analytics Vidhya

7 min readDec 14, 2019

This article briefly covers the evolution of Mask R-CNN and explains different hyper-parameters used. It also highlights different techniques that will help in tuning the hyper parameters of a Mask R-CNN model.

This article describes the learnings of a project built by Anisha Alluru, Elizabeth Reid Heath, Manas Rai, Ravikiran Bobba, Vishal Ramachandran. Follow this GitHub Repository for our full code on implementation of Mask RCNN for instance segmentation of I-materialist Fashion Challenge of 2019.

Introduction:

Mask — RCNN is one of the recent additions to Region Based CNN Family, which was launched by Kaiming He and team from Facebook AI Research (FAIR) in January of 2018. Here is a link to the official paper published by the team. Mask R-CNN has the highest accuracy in the Coco segmentation challenge and post its launch, it is being used extensively for different instance segmentation competitions. Mask R-CNN is an extension of Faster R-CNN which creates the mask at pixel level for each object detected. This has several applications in industries for counting different objects, evaluating positions for objects precisely for operations using robotic manipulators and so on. Mask — RCNN is implemented for different projects by Matterport and have open sourced their extensive work using R-CNN in their Github Repository. Along with this enhanced performance, Mask R-CNN involves several hyper parameters which are to be tuned carefully based on the application. Due to its recent inception, Very limited literature was available on these hyper parameters and this article aims to give an overview of special hyperparameters involved in Mask RCNN.

Evolution of Mask R-CNN:

Mask R-CNN is a meta algorithm applied on Faster- RCNN for instance segmentation. This article “Computer Vision — A journey from CNN to Mask R-CNN and YOLO -Part” gives a detailed explanation on evolution of Mask RCNN. An overview of this evolution is important to understand the hyperparameters, as these were based on these architectures. A brief overview of this evolution is provided below.

Summary of Evolution of Mask R-CNN

Mask R-CNN Architecture:

The block diagram above represents the Mask R-CNN architecture. A brief description of each of the steps is given below:

The image is passed through a convolutional network.
The output of first conv net, is passed through to a Region Proposal network (RPN) which creates different achor boxes (Regions of Interest) based on the presence of any of the objects to be detected.
The Anchor boxes are sent to ROI Align stage (one of the key features of Mask RCNN for protecting spatial orientation), which converts ROI’s to the same size required for further processing
This output is sent to Fully connected layers which will generate the result of the class of the object in that specific region and the location of the bounding box for the object
The output of ROI Align stage is parallelly sent to Conv Nets in order to generate a mask of the pixels of the object

For detailed explanation of steps in Mask R-CNN refer to this article.

Hyper-parameters:

The following are few Hyper-parameters specific to Mask R-CNN

Back Bone
Train_ROIs_Per_Image
Max_GT_Instances
Detection_Min_Confidence
Image_Min_Dim and Image_Max_Dim
Loss Weights : rpn_class_loss
Loss Weights : rpn_bbox_loss
Loss Weights : mrcnn_class_loss
Loss Weights : mrcnn_bbox_loss
Loss Weights : mrcnn_mask_loss

Back Bone:

The Backbone is the Conv Net architecture that is to be used in the first step of Mask R-CNN. The available options for choice of Backbones include ResNet50, ResNet101, and ResNext 101. This choice should be based on the trade off between training time and accuracy. ResNet50 would take relatively lesser time than the later ones, and has several open source pre-trained weights for huge data sets like coco, which can considerably reduce the training time for different instance segmentation projects. ResNet 101 and ResNext 101 will take more time for training (because of the number of layers), but they tend to be more accurate if there are no pre-trained weights involved and basic parameters like learning rate and number of epochs are well tuned.

An ideal approach would be to start with pre-trained weights available like coco with ResNet 50 and evaluate the performance of the model. This would work faster and better on models which involve detection of real world objects which were trained in the coco dataset. If accuracy is of utmost importance and high computation power is available, the options of ResNet101 and ResNeXt 101 can be explored.

Train_ROIs_Per_Image

This is the maximum number of ROI’s, the Region Proposal Network will generate for the image, which will further be processed for classification and masking in the next stage. The ideal way is to start with default values if number of instances in the image are unknown. If the number of instances are limited, it can be reduced to reduce the training time.

Max_GT_Instances:

This is the maximum number of instances that can be detected in one image. If the number of instances in the images are limited, this can be set to maximum number of instances that can occur in the image. This helps in reduction of false positives and reduces the training time.

Detection_Min_Confidence:

This is the confidence level threshold, beyond which the classification of an instance will happen. Initialization can be at default and reduced or increased based on the number of instances that are detected in the model. If detection of everything is important and false positives are fine, reduce the threshold to identify every possible instance. If accuracy of detection is important, increase the threshold to ensure that there are minimal false positive by guaranteeing that the model predicts only the instances with very high confidence.

Image_Min_Dim and Image_Max_Dim:

The image size is controlled by these settings. The default settings resize images to squares of size 1024x1024. Smaller images can be used (512x512) can be used to reduce memory requirements and training time. The ideal approach would be to train all the initial models on smaller image sizes for faster updation of weights and use higher sizes during final stage to fine tune the final model parameters.

Loss weights:

Mask RCNN uses a complex loss function which is calculated as the weighted sum of different losses at each and every state of the model. The loss weight hyper parameters corresponds to the weight that the model should assign to each of its stages.

Rpn_class_loss: This corresponds to the loss that is to assigned to improper classification of anchor boxes (presence/absence of any object) by Region proposal network. This should be increased when multiple objects are not being detected by the model in the final output. Increasing this ensures that region proposal network will capture it.
Rpn_bbox_loss: This corresponds to the localization accuracy of the RPN. This is the weight to tune in case, the object is being detected but the bounding box should be corrected
Mrcnn_class_loss: This corresponds to the loss that is assigned to improper classification of object that is present in the region proposal. This is to be increased in case the object is being detected from the image, but misclassified
Mrcnn_bbox_loss: This is the loss, assigned on the localization of the bounding box of the identified class, It is to be increased if correct classification of the object is done, but localization is not precise
Mrcnn_mask_loss: This corresponds to masks created on the identified objects, If identification at pixel level is of importance, this weight is to be increased

The above Hyper-parameters are represented on the block diagram in the following figure.

Mask R-CNN Architecture with Hyper-Parameters

An ideal approach for tuning loss weight of Mask R-CNN is to start with a base model with a default weight of 1 for each of them and evaluate the performance of the model on validation set by visualizing the model performance on different images and looking into the number of objects detected, accuracy of objects classified, localization of identified objects and localization of the mask. Then the corresponding parameter should be tuned based on the model performance.

Conclusion:

In conclusion Mask R-CNN is a great architecture for instance segmentation. However proper tuning of hyper-parameters is important to achieve its potential. Methods like GridSearch with cross validation might not be useful in cases of CNN because of huge computational requirements for the model and hence it is important to understand the hyper-parameters and their effect on the overall prediction. This article explains the most important hyper-parameters specific to Mask R-CNN and how they are to be tuned.

Let us know if you have any comments or suggestions. We hope this article helps you in understanding the Hyper parameters and helps you in your projects.

We would like to thank our professor Dr. Joydeep Ghosh who motivated us to share our learnings in a blog.

Taming the Hyper-Parameters of Mask RCNN

Written by Ravikiran Bobba