Instance Segmentation

7 min readNov 29, 2019

This blog is Report for IIT Bombay EE 782 Advanced Machine Learning Course Project

Instance Segmentation mask prediction by our model

Introduction:

The vision community has developed various techniques that have improved Classification, Object detection, Semantic segmentation, and Instance segmentation. Classification tells us that the image belongs to a particular class, and it doesn’t consider the detailed pixel-level structure of a copy. Object detection provides classes of the object present in the image, and it also indicates the spatial location of a detected object. Semantic segmentation detects all objects present in a frame at a pixel level, and it makes dense prediction inferring labels for each pixel so that every pixel in the image gets labeled with the class of enclosing the object. Instance segmentation includes the identification of boundaries of objects at the detailed pixel level. It assigns a label to each pixel of the image. Instance segmentation is challenging because it requires the correct detection of all the objects in an image while also precisely segmenting each instance. It, therefore, combines elements from the classical computer vision tasks and object detection, where the goal is to classify individual objects and localize each of the bounding boxes.

Our Work:

We have extended the result of Instance segmentation using Mask R-CNN on Google Open Image datasets. The pre-trained Mask R-CNN model weight is available only for 80 classes, which is trained on the COCO dataset. We have extended it to 300 classes using the Google Open Image dataset. This project is one part of Google Open Image Challenge 2019, which provided exceptionally large and diverse training datasets to inspire research into more advance instance segmentation models. The extremely accurate ground-truth mask provided for this challenge aims at encouraging the development of higher quality models that deliver precise boundaries of the object.

Datasets:

As mentioned above, we have trained Mask R-CNN on Open Image V5 dataset available for the 2019 edition Open Image Challenge. It consists of 300 classes annotated with segmentation masks. The training set contains a 2.1M segmentation mask of 300 classes. These masks have been produced by an interactive segmentation.

Additionally, Open Image V5 also contains a validation set with a 23K mask for these 300 classes. These masks have been annotated manually with a strong focus on quality. They are near-perfect and capture even fine details of complex boundaries.

Approach:

The two main problems which needed to be solved for using the Open Image V5 dataset are Class Imbalance and Class Hierarchy. Class Imbalance refers to too much difference in the number of images present per class for different classes, and Class Hierarchy refers to labels having category and its subcategories.

We have tackled the issue of class imbalance while creating the dataset by using a fixed number of training images for each class. For classes having a higher number of images per class, this will reduce the number of training images present in it. The maximum depth of hierarchy in the given training dataset is 2(counting from 0), and the number of depth two classes is only five in numbers. For handling class hierarchy, we divided all the classes into two groups Layer0 and Layer1.

We grouped all the depth zero classes as Layer0 group and depth one and depth two classes as Layer1. The main idea behind grouping all classes in two different groups is to make a different model for each of these two groups. This grouping reduced our original problem to train two model 1st one for 213 classes and 2nd one for 87 classes. While training these two different models, we used a dedicated dataset which only includes the target group classes instances. By doing so, there is no need to care about class hierarchy. However, it was practically impossible to make a dataset from only training images that include only the target classes and do not include the non-target classes. Therefore we removed non-target class instances from training images.

Layer0 group dataset:

Converted all non-target class to its parent class. So that it becomes a non-target class now. We have converted Teddy bear to Toy not Carnivore

Layer1 group dataset:

Removed all non-target class annotations that don’t have any child class
Removed non-target class annotations that have some child classes and filled their bounding box with a gray pixel in training image as only removing annotation will lead to false-positive signal to model

After creating two different datasets corresponding to both models using the Open Image V5 dataset, we trained our model on it.

Training:

Mask R-CNN is a fairly large model as it uses ResNet101 and Feature Pyramid Network(FPN). We have trained our model on 4xTesla T4 connected in parallel, and it takes seven days for training both models. For Layer0 model training dataset contains around 8,40,000 images, and the validation dataset contains 13,000 images. Layer1 model training dataset contains 3,60,000 and its validation dataset contains 4,500 images. The total number of classes to classify is 213 for Layer0, whereas it is 87 for Layer1, and therefore the former model took more time for training than later one.

Layer0 Training:

We trained our Layer0 model for 150 epochs, and in each epoch total number of gradient steps taken was 700 with a batch size of 8 images per gradient step.
After each training epoch, we calculated an average loss of 50 iterations with a batch size of 8. Hence after every training epoch, our model is validated on 400 images from the validation dataset of 13,000 images. Since we used different images after each epoch for validation, therefore there are some fluctuations in a validation plot.

Training loss and validation loss plot for Layer0

Layer1 Training:

We trained our Layer1 model for 160 epochs, and in each epoch total number of gradient steps taken was 280 with a batch size of 8 images per gradient step.
After each training epoch, we calculated an average loss of 50 iterations with a batch size of 8. Hence after every training epoch, our model is validated on 400 images from the validation dataset of 4,500 images. The reason behind the fluctuations in validation plot is same as for Layer0

Training loss and validation loss plot for Layer1

Results:

The figure shown below shows the final result of our ensembled Mask R-CNN. For, eg. Our Layer0 model predicts every class that is boy, girl, men present in this image as a person, and when we pass the same image through model corresponding to Layer1, then it only able to detect two men and one boy, but not girls. Therefore the final prediction of our ensembled model is (Person, Men), (Person, Boy), (Person, Men) for the classes detected by Layer1. For other objects, it remains the same class as detected by Layer0.

Flow chart of the ensembled model prediction

Here is a detailed description of classes of an above image when passed through model corresponding to Layer0 and Layer1 and its final prediction

Layer0 Prediction: Person, Person Person, Person, Person, Person, Trouser
Layer1 Prediction: Men, Nothing, Nothing, Boy, Nothing, Men, Jeans
Final Prediction: (Person, Men), (Person), (Person), (Person, Boy), (Person), (Person, Men), (Trouser, Jeans)

Table describing the final prediction class of ensembled model

Future Work

In this project, we address the issue of class imbalance by taking a fixed number of images of each class. It is not the appropriate way to handle class imbalance because we discarded too much data, which may help us to increase the accuracy of our model. Alternatively, we can address this issue by up-sampling the data of class which has less number of images and assign a high probability of selecting the data of classes with less number of images and a slightly lower probability to the classes whose data are available in a large number
Instead of training two different models, we can make END to End model with suitable training procedures to help it learn to classify well between different classes as well as parent, children, and grandchildren class label. It will help to remove the ensemble model and hence will decrease the size of the model and makes the overall process faster
Mask R-CNN is a large model with a slow processing time of around five frames per second. Model size and processing time of model can be reduced to make it useful in application like self-driving cars or other real-time application

Here is the link to my implementation

SumanSudhir/Instance-Segmentation

This repository contains the code of the course project of IIT Bombay EE 782 Advanced Machine Learning course.

github.com