Panoptic Segmentation Explained
A more holistic understanding of scenes for computer vision
About a quarter of our users in Hasty are part of the research community. We reached out to them to learn more about their research. We learn so much during these calls, that we asked some of the researchers to write a bit about their work for our blog. This time Juan Lagos Benitez who’s doing his Ph.D. at Tampere University shares his thoughts on panoptic segmentation.
If you want to learn more about the project, reach out to Juan on LinkedIn https://www.linkedin.com/in/juanlagos91/. And definitely don’t forget to check his Github profile where he publishes his extensions to the most famous architecture for panoptic segmentation https://github.com/juanb09111/Pynoptorch.
And yes, the dog in the image is Juan’s puppy. It’s super cute, isn’t it?
Computer vision and scene understanding have become game-changer in today’s world. As we move forward into giving autonomous capabilities to machines to perform tasks in a human-way fashion, understanding the surroundings, objects around, and scenes becomes pivotal. We as humans, not only see things as a mere stimulus, we comprehend what we see. We give meaning to what our eyes capture. We also unconsciously assign attributes to the things we see, such as distance, density, number of objects, amount, speed, dangerousness, texture, and even temperature. We may not be very accurate in each one of those measurements we make instinctively. Still, when we see something, we recognize hundreds of patterns that allow us to perform different tasks effectively, e.g., sports, driving, walking, playing video games, etc…
Panoptic segmentation combines instance segmentation and semantic segmentation to provide a more holistic understanding of a given scene than the latter two alone. In this post, I will walk you through the concept of panoptic segmentation and how it is helping machines to view the world the way we see it. I will also briefly review a novel approach to panoptic segmentation known as EfficientPS (http://panoptic.cs.uni-freiburg.de/), a deep convolutional neural network for panoptic segmentation, which uses EfficientNet as backbone for extracting features and how to obtain ground truth annotations using Hasty.
Semantic segmentation refers to the task of classifying pixels in an image. It is done by predefining some target classes, e.g., “car”, “vegetation”, “road”, “sky”, “sidewalk”, or “background”, where “background” is in most cases a default class. Then, each pixel in the image is assigned to one of those classes. Here’s an example:
As you can see in the previous example, every pixel in the image was colored depending on its class; hence, every pixel belonging to a car is masked in blue and the same goes for the sidewalk, the vegetation, road, and the sky.
So far so good. But what if we want to dig deeper into the type of information we can extract here. Say, for example, we want to know how many cars are in one picture. Semantic segmentation is of no help here as all we can get is a pixel-wise classification. For such a task, we need to introduce the concept of object detection and instance segmentation.
Object Detection and Instance Segmentation
When we do object detection, we aim to identify bounded regions of interest within the image inside of which is an object. Such objects are countable things such as cars, people, pets, etc. It doesn’t apply to classes such as “sky” or “vegetation” since they are usually spread in different regions of the image, and you cannot count them one by one since there’s only one instance of them — there is only one “sky” not multiple.
It is very common to use bounding boxes to indicate the region within which we will find a given object. Here’s an example:
In the previous image, there are three bounding boxes, one for each car on the image. In other words, we are detecting cars, and we can now say how many of them are in the image.
Now, not all the pixels inside those bounding boxes correspond to a car. Some of those pixels are part of the road; others of the sidewalk or the vegetation. If we want to obtain richer information from object detection, we can identify what pixels specifically belong to the same class assigned to the bounding box. That is what is called instance segmentation. Strictly speaking, we perform pixel-wise segmentation for every instance (bounding box in our case) we detected. This is what it looks like:
So we went from a rough detection with a bounding box to a more accurate detection in which we can also identify instances and therefore count the number of objects of a given class. In addition to that, we know exactly what pixels belong to an object.
Sounds very good, but still, we have no information about all the other non-instance classes such as “road”, “vegetation” or “sidewalk” as we did have it in semantic segmentation. That is when panoptic segmentation comes into play!
As mentioned in the introduction of this post, panoptic segmentation is a combination of semantic segmentation and instance segmentation. To put it another way , with panoptic segmentation, we can obtain information such as the number of objects for every instance class (countable objects), bounding boxes, instance segmentation. But, also we get to know what class every pixel in the image belongs to using semantic segmentation. This certainly provides a more holistic understanding of a scene.
Following our example, panoptic segmentation would look like this:
We have now managed to get a representation of the original image in such a way that it provides rich information about both semantic and instance classes altogether.
Now that we’ve covered the basics let’s put this into practice. How can we do panoptic segmentation with a deep learning model? At this point, I’d like to introduce a model developed at the University of Freiburg called “EfficientPS” (http://panoptic.cs.uni-freiburg.de/).
EfficientPS is a deep learning model that makes panoptic predictions at a low computational cost by using a backbone built upon EfficientNet architecture. It consists of:
- A backbone network for feature extraction.
- Two output branches: one for semantic segmentation and one for instance segmentation.
- A fusion block that combines the outputs from both output branches.
Here’s a diagram of the entire network:
Let’s take a closer look at its main modules. First of all, there’s the backbone network that produces four different outputs each with a different spatial resolution, thus obtaining global context features as well as localized features.
This backbone plays with three input parameters: width, depth, and input resolution, and scales them uniformly for better efficiency. Where width refers to the number of channels used in its building blocks, depth refers to the number of repetitive building blocks to be used and the input resolution refers to the input resolution of the first layer, a.k.a. input layer.
Then there’s the semantic segmentation output branch:
This is a much smaller network itself compared to the backbone. It attempts to fulfill three requirements: capture fine features efficiently (large scale), capture long-range context (small-scale), and finally, mitigate the mismatch between large-scale and small-scale features.
As output, this branch returns N layers with logits, where N is the number of classes including “background”. In parallel to this branch, we have the instance segmentation output branch:
The instance segmentation branch resembles Mask R-CNN architecture. It consists of a regional proposal network (RPN) that is connected to two sub-networks. One of these sub-networks returns bounding boxes and their corresponding class predictions, while the other sub-network returns the corresponding mask logits.
Finally, there’s the fusion module which is no longer a parametrized network but rather a heuristic series of blocks that, firstly, threshold, filter, scale, and pads the outputs of the instance and semantic segmentation output branches and secondly, combines the logits by computing the Hadamard product as shown in the following diagram:
Hasty for Image Annotation
Finally, as data scientists, we know that when it comes to data-driven approaches we rely on a lot of data. In our case, we need a lot of images to train our model and do panoptic segmentation. As I was working on this topic I found a great tool for annotating my images, it is called Hasty (https://hasty.ai/). It is an online tool that not only provides a user-friendly interface to annotate images with well-known manual tools such as polygons and bounding boxes, but it also allows for automatic annotations with one-click action! The way it works under the hood is that, as you annotate images, Hasty uses those annotations to train their own models. So, you only have to annotate a few images from which Hasty’s models will learn and you no longer have to do the annotations yourself. Then you can eventually do corrections on the annotations proposed by Hasty very easily. It is a great tool available online which will certainly save you a lot of time. In fact, all the sample images in this post were annotated using Hasty. Here’s a snapshot of one of my annotations done with Hasty:
Panoptic segmentation sets a milestone in scene understanding and computer vision. It gives more meaning and context to what a machine is “seeing” and therefore it leads to better decision making in the case of autonomous machines. EfficientPS is a flexible network in terms of its modularity (backbone, semantic output branch, instance output branch and fusion module). I’ve been working on a model based on this architecture myself. I aim at extending this network with other possible output branches as well as using other backbone networks, like resnet, which could work better depending on the case, you can take a look at the repository here: https://github.com/juanb09111/Pynoptorch