Panoptic Segmentation Explained

Prakhar Bansal
11 min readDec 10, 2021

--

By: Vishu Agarwal, Prakhar Bansal, Suchit Das, Serena Wu

Panoptic Segmentation

Abstract

A picture is worth a thousand words

We all know a single image may convey a message more effectively than a lot of written words. But what consists of an image? When it comes to image segmentation, a common answer might be “things” and “stuff”.

The concept of things and stuff is used when describing image segmentation methods such as instance and semantic segmentation. Instance segmentation is the identification of countable objects, while semantic segmentation is the identification of regions of texture.

The first image shows that instance segmentation enables us to identify each car, person, and basically every thing. The second image shows that semantic segmentation marks each stuff in a different color. It does not differentiate instances of the same object. For example, all cars are masked with dark blue, and the whole ground is marked with purple.

Instance + Semantic = Panoptic Segmentation

What about the last image then? It is panoptic segmentation — a combination of both. When we perform the image segmentation, we not only assign a semantic label to each pixel but also an instance id. Panoptic segmentation unifies scene-level and subject-level understanding.

Introduction

It is an unstopping quest for data scientists and AI engineers to replicate what we humans exhibit on a daily basis using machine learning models. Such models include panoptic segmentation, which has been widely researched in labs of prominent companies such as Facebook, Uber, and Tesla. Instance and semantic segmentation enables us to find things and stuff in an image, but it fails to give us a comprehensive view of the big picture.

Andrej Karpathy, Senior Director of AI at Tesla, posted a tweet on a panoptic segmentation project of Tesla on Nov 30, 2021. For autonomous driving, it is very important to know both the objects around the car and what surface it is driving on in order to navigate on streets safely. Based on the video, Tesla seems to be very close to having a reliable computer vision system where they can identify roads and relevant objects on the roads.

Panoptic Segmentation in Autonomous Driving

Other examples of panoptic segmentation in-use is Apple’s on-device panoptic segmentation for cameras using transformers. The iPhone can have pixel-wise comprehension of the person and what the background is composed of. As mentioned in their Machine Learning Research blog in October, 2021 “Our approach to panoptic segmentation makes it easy to scale the number of elements we predict, for a fully parsed scene, to hundreds of categories. This year we’ve reached an initial milestone of predicting both subject-level and scene-level elements with an on-device panoptic segmentation model that predicts the following categories: sky, person, hair, skin, teeth, and glasses.” This cutting-edge machine learning provides pixel-level understanding for their camera and enables a wide range of features in the camera app.

Evaluating Panoptic Segmentation: PQ Metric

Common ways to calculate the performance of image segmentation include intersection over union (IoU) and average precision (AP). IoU is used to evaluate semantic segmentation results and can be calculated by dividing the total number of pixels in the intersection area by the number of pixels in the union area. This method ignores instance id, so it is not suitable for evaluating panoptic segmentation results. On the other hand, AP requires each object to have confident/probability scores to produce a precision/recall graph. Panoptic segmentation combines instance and semantic segmentation, so we do not have a clear score to evaluate semantic predictions.

The currently existing metrics completely focus on evaluating how the semantic and instance segmentation work and cannot be used to evaluate the joint task of evaluating both the stuff and thing classes. As a result, the Facebook AI research team came up with a novel approach of evaluating the Panoptic Segmentation algorithm through the Panoptic Quality (PQ) metric. This approach consists of 2 steps:

1) Segment Matching

2) PQ computation given all the correct matches

To better understand the PQ metric evaluation, we will need to understand what segment matching is and how it flows into computing the PQ metric.

Segment Matching

Segment matching, on an overall level is nothing but a ratio which we call intersection over union. That ratio is defined as the combination of the total overlap between the ground truth and the prediction class and the total area that we have to overlap covering the ground truth and the prediction class. The predicted segment and the ground truth segment can match only if their intersection over union is strictly greater than 0.5 or any other predefined threshold.

If the number falls below 0.5, then it implies that there is not much of an overlap and hence there is not much of a match.

Intersection Over Union Evaluation Metric

PQ computation

After computing the Intersection over Union metric, we take the sum of the all the IoU’s across all the objects of interests and sum them to come up with a number that gives us an overall view as to how our algorithm performs on an overall level. Apart from averaging, we also divide the sum with a happy blend of precision and recall where we include the true positives, half of false positives and half of false negatives. The denominator basically acts as a metric that penalizes the score in case of any wrong prediction and gives us an optimum number that tells a better view of how the algorithm is performing for the input images

We can also go ahead the divide the above metric into 2 parts by multiplying and dividing the metric by True Positive in the original PQ formula:

Decomposing PQ Metric

The first part is called segmentation quality (SQ) which gives us an overall idea as to how well the algorithm performed for all the correct matches and the second part is called recognition quality (RQ) which penalizes for all the wrong predictions as well.

Original Implementation of Panoptic Segmentation

Previous methods of panoptic segmentation had two separate branches designed for semantic and instance segmentation individually.

Individual Implementation of Panoptic Segmentation

The semantic segmentation branch is conducted using Fully Convolutional Network (FCN). A FCN transforms the size of the intermediate feature maps back to the same size of the input image. The network consists of down-sampling, which is used to extract and interpret context, and up-sampling, which is used to obtain the location information.

Encoder-Decoder implementation of Semantic Segmentation

A down-sampling layer helps to reduce the dimensionality of the features at the cost of some loss in information. This helps save computations. Average pooling and max pooling are some common examples of down-sampling layers. On the other hand, the opposite of the pooling layers are the up-sampling layers which in their purest form only resize the image (or copy the pixel as many times as needed).

Mask R-CNN for Instance Segmentation

FPN based implementation of Semantic Segmentation

For Instance segmentation we leverage the Mask R-CNN, the input image is passed through Resnet and Feature Pyramid Network which acts as a feature extractor. After processing the image, it is allowed to go through Region Proposal Network where we extract the regions of interest on the input image and pass them during multiple convolutional networks to extract 3 outputs which is the Class Output, Box Output, and the Mask Output.

Regions of Interest for Instance Segmentation

First 2 outputs come from the Region Proposal Network where we label the instances in the input image and give a bounding box to each of those instances. At first the algorithm can have multiple bounding boxes along with multiple instances, but after processing through the RPN we get an image which is segmented into multiple instances and proper bounding boxes.

Instance Segmentation output with Bounding Box and Class Label

In parallel, we also put the input image into the mask classifier, which gives us individual masks for all the instances and which we finally combine to get an image that looks something like below:

Final output of Instance Segmentation

Unified Network Implementation for Panoptic Segmentation

FAIR (Facebook AI research) published a paper proposing Unified Panoptic FPN to perform both Semantic and Instance segmentation in one shot. Their idea was to leverage FPN (Feature Pyramid Network) used for feature extraction as the backbone for the combined network capable of performing both Semantic and Instance segmentation. Before discussing the motivation, logic and working of this newly proposed network, let’s discuss a little about the key ideas behind FPN.

Proposed Unified FPN Network for Panoptic Segmentation

Key Ideas about FPN

The bottom-up pathway in Feature Pyramid Network as shown below is the usual convolutional network leveraging convolutional and max-pooling layers used for feature extraction. As we go up the layers, semantic value for each layer increases while the spatial resolution of the image decreases. To understand better, we can think about it in this way — as we go up, the same amount of semantic information is available across fewer pixels and hence semantic information per pixel is more than the bottom layer.

Bottom-up pathway in FPN

Now that we have semantic rich layers, we want to construct higher resolution layers which are rich in semantic information as well. To do so, FPN provides a top — down pathway as well. Up-sampling is performed to reconstruct the layers in this top-down pathway. However, there is a small caveat here — reconstructed layers are semantically rich but the locations of the objects in the image might not be precise due to all the down-sampling and up-sampling performed in the FPN. This is overcome by adding lateral connections between the reconstructed layers and the corresponding feature maps to help detectors better predict the locations. Now that we have discussed the key ideas behind FPN, let’s discuss the Unified Panoptic FPN proposed by FAIR.

FPN Backbone for Panoptic

Unified Panoptic FPN

Let’s try to understand the motivation behind this network which might help us answer questions like — Why did FAIR decide to choose FPN as the shared backbone? What was the reasoning behind it?

Top right path (1) of this Unified Panoptic FPN is similar to the Mask R-CNN structure we discussed previously, and hence it yields the expected results for instance segmentation. However, bottom right segment (2) — semantic segmentation head is where things get interesting. Traditionally, semantic segmentation leverages encoder-decoder networks to perform down-sampling and up-sampling to identify similar regions. Since FPN also performs down-sampling and up-sampling by its very nature and is already a part of Mask R-CNN, FAIR had the unprecedented idea of using FPN as a shared backbone for both semantic and instance segmentation.

Implementation

Facebook AI Research team has released Detectron2 which contains state of the art computer vision models. The GitHub repository of Detectron2 has open source model weights of many computer vision algorithms. It contains model weights for the following algorithms:

  1. MASK R-CNN
  2. Instance Segmentation
  3. Dense Pose Estimation
  4. Panoptic Segmentation

Dataset

Detectron2 has trained Panoptic Segmentation on COCO (Common Object in Context) Dataset. COCO is a widely used visual detection dataset with a focus on full scene understanding.

To train Panoptic Segmentation, Detectron2 used 118K images for training and 5k images for testing. Because Panoptic requires stuff (instance) and things (semantic) in one image so the annotation required is different from all other vision tasks.

Sample Example of COCO Dataset annotation for Panoptic Segmentation:

Image annotation example for Panoptic

Through the COCO dataset, we can detect 80 classes for things (person, bicycle, elephant) and 91 classes for stuffs (grass, sky, road).

Pre-Trained Panoptic Segmentation on an Image

Original image which we will use to evaluate Panoptic Segmentation
Panoptic segmentation detects instances and semantics in an image. It detects umbrellas and persons differently and also grass, road, sky in one instance
Final results of Panoptic

Above image shows the final results of panoptic segmentation where the labels obtained from the model are superimposed on the original image. As we can see it is a full scene understanding of the image.

Our implementation of the above results are available on the below GitHub link. We have extensively used the Detectron2 implementation of the Panoptic Segmentation.

Panoptic Implementation Code

The GitHub repository contains Panoptic Segmentation implementation which extensively uses Detectron2’s implementation.

Link: https://github.com/prakharb13/Panoptic-Segmentation

Next Steps

Panoptic Segmentation requires images to be annotated differently as compared to other computer vision algorithms. Hence, Panoptic Segmentation can be performed on only a handful of datasets like COCO, Cityscapes, Indian Driving Dataset, Mapillary Vistas, and ADE20k which have the required annotations. The next step is to create an ecosystem so that many more datasets are annotated accordingly and Panoptic Segmentation can be used on a wide spectrum of use cases.

Training Panoptic Segmentation on local machines without a powerful GPU is not feasible due to computational bottleneck and a large number of images to train. To industrialize its use, the computer vision industry can come up with easier ways to implement wrappers that could give results on custom datasets and expand its use cases across multiple domains.

Panoptic Segmentation is a full-scene segmentation combining both Instance and Semantic Segmentation and has use cases in many industries. As time progresses, we expect researchers to come up with improved architectures to overcome the current roadblocks and limitations.

References

Alexander K. et al. (2019). Panoptic Segmentation

Yuwen, X. et al. (2019). UPSNet: A Unified Panoptic Segmentation Network

Apple (2021). On-device Panoptic Segmentation for Camera Using Transformers

Andrej, K [@karpathy]. (2021, Nov. 30). Twitter

Harshit K. (2019). Introduction to Panoptic Segmentation: A Tutorial. Medium

Stanford University School of Engineering (2017). Lecture 11 Detection and Segmentation. YouTube

Daniel M. (2019). Panoptic Segmentation — The Panoptic Quality Metric. Medium

--

--