MaskFormer : Per-Pixel Classification is Not All You Need
for Semantic Segmentation

8 min readJun 22, 2023

1. MaskFormer: A new era in image segmentation

Object detection and instance segmentation are fundamental tasks in computer vision that play a pivotal role in a myriad of applications, ranging from autonomous driving to medical imaging. Traditional methods often leverage bounding box techniques for object localization followed by per-pixel classification to assign classes to these localized instances. However, these methods often falter when handling overlapping objects of the same class, or in scenarios where the number of objects per image varies.

Classical approaches such as Faster R-CNN, Mask R-CNN, and others, although highly effective, have struggled with these challenges due to their inherently fixed-size output space. They typically predict a fixed number of bounding boxes and classes per image, which may not match the actual number of instances in an image, especially when it varies across images. Furthermore, they may not adequately handle situations where objects of the same class overlap, leading to classification inconsistencies.

Mask R-CNN Overlapping Bounding Boxes Problem | by Buse Yaren Tekin | Towards AI

In this article, we are going to talk about MaskFormer, a method released by Facebook AI Research in 2017 for instance segmentation that transcends these limitations.

Let’s jump right in guys ! Humm.. but first, i owe you some explainations to understand :

2. The difference between per-pixel classification and mask classification

Per-pixel classification :

This method refers to assigning a class label to every individual pixel in an image. In this case, every pixel is treated independently, and the model predicts what class that pixel belongs to, based on the input features at that pixel’s location. Per-pixel classification can be highly accurate for well-defined objects with clear boundaries. However, it can struggle in situations where the objects of interest have complex shapes, overlap with each other, or are situated within a cluttered background this can be explained because of the tendency of these models to view objects in terms of their spatial boundaries first.

Consider an image depicting multiple overlapping cars. Traditional instance segmentation models such as per-pixel models might struggle with such a scenario as you can se below. There cars overlap, these models might create a single, merged mask for the entire set of overlapping cars. They could misinterpret the scene as containing one large, oddly shaped car instead of multiple distinct cars.

Per pixel classification usually makes single mask for several similar objects.

Examples of models using per-pixel classification/segmentation: FCNs, ASPP, OCNet, SETR, Segmenter, Vit…

Mask classification

Mask classification (used in MaskFormer), on the other hand, takes a different approach. Instead of classifying each pixel independently, a mask classification model predicts a class-specific mask for each object instance in an image. This mask is essentially a binary image that indicates which pixels belong to the object instance and which don’t. In other words, a single mask represents the entire object, not just individual pixels.

In the former example, using mask classification make us able to recognizes that there are multiple instances of the “car” class in the image and assigns each a unique mask, even where they overlap. Each car is treated as a distinct instance, and given its own unique mask, preserving its identity separate from the other cars.

examples of models using mask classification/segmentation: Mask R-CNN, DETR, Max-deeplab..

Now that you understand the difference between per-pixel/mask classification,we can jump to another subject : DETR model. (I promise that you will soon see the link with MaskFormer. Don’t panic!)

3. DETR model

At the core of DETR is a powerful mechanism known as the Transformer, which allows the model to overcome some of the key limitations of traditional per-pixel and mask classification methods.

Consider our busy street scenario with overlapping cars. In a traditional mask classification approach, if two cars overlap, it might be still challenging to separate them as distinct entities, even if it’s better than per-pixel method. DETR offers an elegant solution to such problems. Instead of generating masks for each car, DETR predicts a fixed set of bounding boxes and associated class probabilities. This “set prediction” approach allows DETR to handle complex scenes involving overlapping objects with remarkable efficiency.

Pretty cool, but where does MaskFormer fit into this picture?

While DETR revolutionizes the bounding box predictions, it doesn’t directly provide segmentation masks — a detail crucial in many applications. Here, MaskFormer steps in, extending the robust set prediction mechanism of DETR to create class-specific masks for each detected object. MaskFormer thus builds upon the strengths of DETR and augments it with the ability to generate high-quality segmentation masks. In our car scenario, MaskFormer not only recognizes each car as a separate entity (thanks to DETR’s set prediction mechanism) but also generates a precise mask for each car, accurately capturing their boundaries, even in cases of overlap.

This synergy between DETR and MaskFormer opens a world of possibilities for more accurate and efficient instance segmentation, transcending the limitations of traditional per-pixel and mask classification methods. In the next sections, we will delve deeper into the working of MaskFormer and understand its architecture and advantages.

4. How does MaskFormer works exactly ?

Here the architecture of MaskFormer :

Let’s details together this scheme:

Feature Extraction via the Backbone: The journey of MaskFormer begins with a backbone network, which is responsible for extracting crucial image features from the input. This backbone could be any popular CNN (Convolutional Neural Network) architecture, like ResNet, that processes the image and extracts a set of features, denoted by F.
Per-Pixel Embedding Generation: These features F are then passed to the Pixel Decoder, which gradually upsamples the image features to generate what we call “per-pixel embeddings” (E pixel). These embeddings capture the local and global context of every pixel in the image.
Per-Segment Embeddings Creation: In parallel, a Transformer Decoder attends to the image features F and generates a set of ’N’ per-segment embeddings, denoted by Q, thanks to mechanism called “attention” which assign different importance weights to different parts of the image. These embeddings essentially represent the potential objects (or segments) in the image that we want to classify and localize.

The term “segment” here refers to potential instances of objects in the image that the model is trying to identify and segment.

Usually, the encoder processes the input data and the decoder uses this processed data to generate the output. The inputs to the encoder and decoder are generally sequences, like sentences in a machine translation task.However, in the context of DETR and MaskFormerthe role of the encoder and decoder is somewhat different. The ‘encoder’ in this case is the backbone (a Resnet50 for maskFormer), which processes the input image and generates a set of feature maps. These feature maps serve the same purpose as the encoder output in a traditional Transformer, providing a rich, high-level representation of the input data.

4. Class and Mask Prediction: These embeddings Q are then used to predict N class labels and N corresponding mask embeddings (E mask). This is where MaskFormer really shines. Unlike traditional segmentation models that predict class labels for each pixel, MaskFormer predicts class labels for each potential object segment, along with a corresponding mask embedding.

5. Binary Mask Prediction: After obtaining the mask embeddings, MaskFormer produces N binary masks through a dot product between the pixel embeddings (E pixel) and mask embeddings (E mask), followed by a sigmoid activation. This process results in potentially overlapping binary masks for each object instance.

6. Final Prediction (for Semantic Segmentation): Lastly, for tasks like semantic segmentation, MaskFormer can compute the final prediction by combining the N binary masks with their corresponding class predictions. This combination is achieved via a straightforward matrix multiplication, giving us the final segmented and classified image.

5. MaskFormer for semantic and instance segmentation

Let’s make a quick reminder:

The difference between semantic and instance segmentation is an important distinction in the field of computer vision.

Semantic segmentation is concerned with labeling each pixel of an image with a class label (such as ‘car’, ‘dog’, ‘person’, etc.). However, it does not distinguish between different instances of the same class. For example, if there are two humans in an image, semantic segmentation would label all the pixels belonging to both humans as ‘human’, but it wouldn’t differentiate between human 1 and human 2 (You can see an illustration bellow).
On the other hand, instance segmentation not only classifies each pixel but also separates different instances of the same class. So, in the humans example, instance segmentation would label all the pixels belonging to human 1 as ‘human 1’ and all the pixels belonging to human 2 as ‘human 2’.

Difference between three computer vision tasks

Most traditional computer vision models treat semantic and instance segmentation as separate problems and would require different models, loss functions, and training procedures for each.

MaskFormer, however, is designed to handle both tasks in a unified manner, thanks to its mask classification approach, which works by predicting a class label and a binary mask for each object instance in the image. This approach inherently combines aspects of both semantic and instance segmentation.

For the loss function, MaskFormer uses a unified loss function that is designed to handle this mask classification problem. This loss function evaluates the quality of the predicted masks in a way that is consistent with both semantic and instance segmentation tasks.

Therefore, the same MaskFormer model, trained with the same loss function and training procedure, can be applied to both semantic and instance segmentation tasks without any modifications.

Conclusion

In summary, MaskFormer presents a new approach to image segmentation, integrating the strengths of the DETR model and the Transformer architecture. It uses mask-based prediction, enhancing the handling of complex object interactions within images.

Its ability to tackle both semantic and instance segmentation tasks using the same model, loss, and training procedure demonstrates the effectiveness and flexibility of mask classification. The use of a Transformer decoder allows for variable object predictions, tackling challenges with overlapping and nested instances.

MaskFormer’s unified approach is a substantial step forward in image segmentation, opening new possibilities for advancements in computer vision. It sets the stage for further research, aiming to enhance our capability to comprehend and interpret the visual world.

References

MaskFormer Paper: https://arxiv.org/pdf/2107.06278.pdf

MaskFormer : Per-Pixel Classification is Not All You Needfor Semantic Segmentation