Real Time Instance Segmentation with Detectron2

6 min readDec 12, 2022

Introduction:

In computer vision, segmentation task is mainly trying to solve

Thing : Any countable entity such as a person, bird, flower, car etc. is termed as thing.
Stuff : An uncountable amorphous region of identical texture such as the sky is termed as Stuff.

Study of thing comes under instance Segmentation with each instance given a different annotation.Semantic Segmentation try to classify each pixel.But Instance segmentation try to classify all the pixels present under the detected boxes.

Segmentation task can be classified as

Instance Segmentation.
Semantic Segmentation.
Panoptic Segmentation.

Fig.1 Different Types of Segmentation Tasks

Instance Segmentation

Instance segmentation mainly try to solve two tasks:

Object Detection: Detect all the ROIs (Region of Interest) where objects present.
Classification and Mask Prediction: Classify & Find the mask in all detected ROIs.

Object detection algorithm can be classified in to single-stage or two-stage.

Fig.2 Two-Stage vs Single-Stage Object detection

So based on the type of object detection used, the Instance Segmentation can be broadly classified in to 2 types:

Two-Stage Instance Segmentation (Detect-then-Segment): It performs the Detection & Mask generation sequentially. For example, Mask RCNN, MaskLab are example of two-stage instance segmentation.
Single-Stage Instance Segmentation (Detect and Segment): Detection & Mask generation tasks are done simultaneously.For example, YOACT, CentreMask etc.

fig.3 Two-Stage vs Single Stage Instance Segmentation

Real Time Instance Segmentation:

Instance segmentation performance (i.e. Speed) mainly depends on:

Which object detection algorithm is used ? Is it Anchor based or Anchor free?

Here are the drawbacks in Anchor based Object detectors :
— Detection performance is dependent on size,aspect ratio & number of anchor boxes.
— Predefined anchor boxes can not be generalization for different shapes.
— Too many anchor boxes are labelled as negative samples which lead to class imbalance with positive samples.
— Anchor uses complex IOU computation which affects performance.
— Difficulty with large shape variations,especially with small objects.

Does this object object detection algorithm is single-stage or two-stage based?
Does this Object detection algorithm uses post processing steps like NMS (Non-Maximal Suppression)?

fig.4 Post Processing Steps in Instance Segmentation

So all these above factors decides the real time performance for instance segmentation.In order to perform in real time, the instance segmentation architecture should be Single-Stage based, Archor free & NMS free.

Here are the list of Instance Segmentation architectures which has been developed over the period of time.

Fig.5-Different Instance Segmentation Architectures

Which Architecture to Use:

Normally,any instance segmentation has a Backbone(in order to extract features), Object detection (for ROI prediction) branch and a Mask prediction branch.The performance and accuracy depends on what you have selected in these placeholders.

Fig.6- Basic Building block of Instance Segmentation

In backbone architecture, you can use ResNet-50/101,Xception, FPN, VoVNetV2, MobileNet etc.If real time performance is the objective,then use any light-weight architecture like ResNet-50,MobileNet,VoVNet-19 etc.But if the requirement is to identify small/tiny objects present in a scene,then use big architecture like ResNet-101,VoVNetV2–99 etc.

For Object detection, most famous State of the art (SOTA) algorithms are FCOS, RetinaNet, DETR(transformer based) etc.And they are used in many state of the art Instance Segmentation architecture.Here are some list:

Fig.7-Different Object Detection Architectures and it’s use in different Instance Segmentation Architecture

Mask prediction is the most important part of Instance segmentation and different architecture uses different techniques to achieve good masking result. For mask branch, different architectures uses different technique.For example, Mask-RCNN architecture only uses mask branch to predict the mask.But MaskScoring-RCNN uses 2 branches,mask branch and mask-scoring Branch, in order to perform better with mask prediction.

Fig.8- Mask Branch in Mask-RCNN and MaskScoring-RCNN

Right now we have a clarity on what is Instance segmentation and how it works.However,there are many architectures present for solving Instance Segmentation task as mentioned in figure 5.But the question is which one to choose from these many choices.Then,i need to ask myself which problem i am trying to solve & does it requires 30 FPS.

Let’s say my problem falls in to the below category

Use Case-1: Segment Large Objects in real time with low compute device.For example, Autonomous Vehicle demand real time performance where many neural networks are running in parallel & compute capability is minimal.

Choose a Single Stage , an anchor free and post processing free based Architecture where inference can be done in real time.

Use Case-2: Segment Small object with high compute capability system.In Healthcare domain, the main objective is to detect small object with accuracy & there is no system constraint.

Use Transformer based Instance segmentation architectures such as QueryInst, ISTR etc. where accuracy is better than any other architecture.

Fig.9- FPS Comparison between different Architectures

From Fig.1.1,it is clear that SOLOV2 is better than YOLACT(You Only Look At Co-Efficient) in terms of speed and accuracy.From Fig1.2, QueryInst is better than SoloV2.And finally from fig.1.3,we found that CenterMask & it’s variants are better than SOLOV2 & YOLACT.

Let’s evaluate it by writing some code.

Implementation of Instance Segmentation with Detectron2 and AdelaiDet Framework.

I used AdelaiDet & Detectron2 framework to perform instance segmentation on Balloon and CityScapes Dataset. My source code is available at https://github.com/satya15july/instance_segmentation .Please give a try.

AdelaiDet(https://github.com/aim-uofa/AdelaiDet) is written on top of Detectron2 and it’s provided implementation for FCOS(Fully Convolutional One Stage object detection) based Instance segmentation.For example,

SOLOv2
CondInst
BoxInst etc

(Note: CenterMask is hosted in a different repo(https://github.com/youngwanLEE/centermask2) .Though it is FCOS based .but it’s not part of AdelaiDet repo.I made some changes to make this work with AdelaiDet framework).

Fig.10-Inference Time with different Architectures

From the above figure,it’s clear that CenterMask variant like CenterMaskLite_MV2(MobileNetV2) and CenterMaskLite_V19*(VoVNetV2) is performing the best as suggested by the data from Fig7.Here the evaluation is done with Input size (1536,2048,3 ) and if you reduce the Input size,then CenterMask Architecture will give you super fast result.

Let’s discuss about CenterMask architecture.

CenterMask:

CenterMask has come up with few design improvements to do instance segmentation.

Optimized VoVNet (VoVNetV2): VoVNet was purposed in https://arxiv.org/abs/1904.09730 and the author of CenterMask optimized it following strategy.
- Residual connections for alleviating the optimization problem of larger VoVNet.
- Effective Squzee-Excitation(eSE) for dealing with channel loss problem of original SE.
FCOS with Adaptive ROI Assignment Function:
SAG-Mask(Spatial Attention Guided Mask):It mainly helps to focus on important features and suppress unnecessary ones.

Fig.12- Optimized VoVNetV2 Used in CenterMask

Fig. ROIAlign used in MaskRCNN. k0 is the featuremap(k0=4,if featuremap is P4). w & h are width & height of ROI

Fig. AdaptiveROIAlign.kmax = last level(P7), Ainput = InputImage and ARoI=Region of Interest

Conclusion:

Here we discussed about what is Instance Segmentation, how it works, how it’s speed is affected by different parameters and what are the different architectures available till date.If your objective is to perform Instance segmentation in real time on CPU based target,then i think CenterMask is the right choice.If your target device is CPU + CUDA,still you can use Centermask.But it’s good to give a try with YOLACTEdge,which makes use of CUDA compute capability properly.I will try to include this in future.

If your objective is accuracy than speed,then it’s worth trying transformer based Instance Segmentation architecture like QueryInst, SOTR, SOLQ, ISTR etc.

I hope this small article will give you some insight on Instance Segmentation and how to use it.I have shared my source code :https://github.com/satya15july/instance_segmentation.

Please do not forget to subscribe to my medium blog .

References:

Reach me at

LinkedIn: www.linkedin.com/in/satya1507
Mail-Id: satya15july@outlook.com