What is Panoptic Segmentation and why you should care.

8 min readJan 29, 2019

Image from https://www.pexels.com/@fotorobot

We humans are gifted in many ways, yet we are quite often oblivious to our own magnificence. Our amazing capacity to decode and comprehend sounds, interpret and identify visual stimuli, and rationalize about situations to formulate desirable outcomes is nothing short of amazing.

For years AI engineers have developed a plethora of tasks to challenge their machine learning models and try to replicate the brilliance we humans exhibit on a daily basis.

These research tasks eventually spin-off into fantastic technologies:

The efficacy of devices such as Google Home and Amazon Alexa are testament to the advancements in the natural language processing arena.
Whether it be to gain access to a device or put a silly filter over our faces in a video conversation, facial recognition technology in our smart phones are all a byproduct of clever computer vision tasks and models.
Self driving cars, arguably one of the most heavily invested emerging technologies today, is a direct beneficiary of many AI academic research tasks.

As marvelous as these technologies are, we are far from the pinnacle when it comes to AI research and subsequently reaping the technological rewards from this arena.

New research tasks are often emerging, which in turn drive new machine learning models, which in turn form new technological products, which in turn end up shaping the world we live in.

So today I want provide you with a light introduction to a new research task, known as Panoptic Segmentation, leaving you with some thoughts about how this task can evolve into emerging technologies.

So, let’s begin!

What is Panoptic Segmentation?

To really understand what Panoptic segmentation is, there’s a fair few ingredients that we need first.

The simplest way to explain panoptic segmentation is to say it’s a combination of instance and semantic segmentation, but if those two concepts mean absolutely nothing to you, as they did to me when I first saw them, then let me guide you through those two tasks first. But first, I’ll need to start off with….

Object Detection

To get into instance segmentation, it’s important we cover what object detection is briefly. Let me give you an example of what object detection is with a cute picture of some kitties:

So if we were to run this picture through an object detection machine learning algorithm, we would want our algorithm to detect all three cats, by correctly classifying them and then correctly identifying where these cats are located.

Our ground truth (where we have marked the cats as being located) and the ideal prediction for our algorithm would look something like this:

The task for object detection would then be to accurately predict these cats and the corresponding bounding boxes. In the prediction process, each of these predictions would be accompanied with a confidence score, which is a probability score for how likely our algorithm believed each object was a cat.

This probability score is a probability for all classes, so if you take any one of the predicted instances as an example, we could have a probability distribution that looks something like this:

classes     = [“cat”, “dog”, “bicycle”, “nothing”]prediction  = [ 0.8 ,  0.1 ,   0.05,      0.05   ]

The algorithms output would also require a coordinate system in order to produce the bounding box around our object, which for each of our predictions, could have an output of something similar to:

legend     = [ “X-Position", "Y-Position", "Length", Height”]prediction = [     130,           285,       100,     185   ]

The X & Y positions above represent the midpoint of the object and the bounding box is then produced by extracting the length and height which is anchored at that midpoint.

While the probability outputs and the bounding box output are combined for our final output prediction, it’s important to know that they are performing two separate tasks. The probability output is performing classification, while the bounding box is performing regression. To understand the difference between the two, you can check out this article.

Ok! So to summarize:

In an object detection task, we are trying to get an algorithm to predict the class and bounding box location of each instance in our image.

So now that we have that understood, it’s only a small step to instance segmentation. They say an image can tell a thousand words, so let me show you what it is:

Instance Segmentation

Instance segmentation takes object detection a step further. Rather than simply asking our algorithm to draw a box around our instances, we now want it to identify which pixels belong to that instance too.

So building on top of our object detection task, our instance segmentation algorithm must now predict 3 things:

A class label

classes    = [“cat”, “dog”, “bicycle”, “nothing”]prediciton = [ 0.8 ,  0.1 ,   0.05,      0.05   ]

2. A Bounding Box

legend     = [ “X-Position", "Y-Position", "Length", Height”]prediction = [     130,           285,       100,     185   ]

3. A Binary Mask

Each instance we predict will produce a similar binary mask (a 2D array), that has a data point representing the same pixel width & height of the image.

Each pixel in our mask is labeled either a 1 or 0 (true or false) for whether or not it belongs to the predicted instance.

Semantic Segmentation

Semantic segmentation is also in the business of assigning pixels to their various classes, but unlike instance segmentation it does not care about the individual instances inside the image, only what class they belong to.

A semantic segmentation of our kitty picture would look something like this:

You can see that we have 3 regions where our pixels have been colored, which in our case would correspond to 3 classes:

[“background”, “pavement”, “cat”]

Unlike instance segmentation, we are concerned with all regions. Not just the regions we deem to have instances.

Items in an image that could possess more than 1 countable instance (bicycle, dog, car, person) are called ‘things’ in most academic articles, whereas regions that are harder to quantify (pavement, ground, dirt, wall) are called ‘stuff’.

Naturally there’s a bit of subjectivity to what you can consider to be stuff vs things.

For example, one may classify ‘street-pavement’ as being a stuff region, however there may be rational ways of producing instances for these regions, such as sectioning their instances based on the number of visible streets in the image.

Now going back to semantic segmentation and our kitty picture, we can see that for the semantic segmentation task our algorithm isn’t concerned with identifying instances. It’s only focus is labeling all the stuff that is sees inside the image.

That means all cats are treated equally as one stuff region of “cat”, there are no explicit confidence scores for each instance of the cats as we saw with object detection / instance segmentation.

Great! So now we are at a position where we have:

Instance Segmentation

A task that requires the identification and segmentation of individual instance in an image.

and

Semantic Segmentation

A task that requires segmenting all the pixels in the image based on their class label.

So the next evolutionary step would appear to be:

What if we want to identify the class label for all pixels, as well as identify all the instances in our image?

Panoptic Segmentation

As mentioned at the very start of this article, panoptic segmentation is a combination of instance and semantic segmentation.

In the panoptic segmentation task we need to classify all the pixels in the image as belonging to a class label, yet also identify what instance of that class they belong to.

The format for doing so, is to have each pixel in our image have two values associated with it:

[“L”, “Z”] => ["Label", "Instance Number"]

As mentioned earlier there is a distinction created between stuff and things for segmentation tasks. This means that pixels that are contained in uncountable “stuff regions” (such as background or pavement), will have a Z index of None or an integer reflecting that categorization.

However in our example above, you can see that all the cats have their individual instance id’s, which enable us to identify them from one another.

Whereas a common png output would produce a 3 channels for a color image, this labeling and prediction format can be expressed as a two channel output, where channel 1 displays each pixel’s label and channel 2 displays each pixels instance.

Summary and Final Thoughts.

In this article we have discussed a few of the latest emerging computer vision tasks, ending with Panoptic Segmentation, which can be described as a method of capturing the identity and instance of all pixels in an image.

So where could Panoptic Segmentation be used? That’s really up to your imagination, but some examples are:

Medical imagery, where instances as well as amorphous regions help shape the context.
Self driving cars and autonomous vehicles, as we need to know what objects are around the vehicle, but also what surface the vehicle is driving on.
Digital Image processing. Want a better smartphone camera? Software that can have pixel-wise comprehension of the people in the image as well as what comprises the background will give you that.

Up Next

In the next article I will be discussing how we can assess our Panoptic Segmentation models, introducing the Panoptic Quality metric.

In the meantime, if you want to skip ahead and get hands on with Panoptic Segmentation, I would recommend you to produce a late submission to the Panoptic Segmentation Challenge. If you need an example to reference, feel free to check out my submission on github here.

Stay Tuned!