Label-Efficient Semantic Segmentation

Merantix Momentum

Published in

Merantix Momentum Insights

16 min readApr 17, 2023

Author: Alexander Koenig

Introduction

Semantic segmentation is a core task in computer vision and has vast applications ranging from autonomous driving (Zheng et al. 2021), and environmental monitoring (Alzu’bi et al. 2022), to medical imaging (Saha et al. 2018). As shown in Figure 1, semantic segmentation involves assigning pixel-wise class labels to an image, while instance segmentation is a variant of semantic segmentation whereby separate instances of objects are distinctly classified. On the other hand, image classification aims at predicting a global image label, and object detection simultaneously identifies the class and the coarse location of objects in an image.

Figure 1: Taxonomy of computer vision tasks (Wu et al. 2020).

Label-Scarcity in Semantic Segmentation

Current state-of-the-art deep learning approaches are trained in a supervised fashion, meaning they require a large amount of labeled training data, i.e., explicit pixel-wise target labels, also called segmentation maps. However, annotations for semantic segmentation are especially cumbersome and costly to obtain since every pixel needs to be labeled by a human. In contrast, global image labels and bounding boxes, which are sufficient for image classification and object detection, require fewer labeling efforts. Furthermore, open-source datasets with dense segmentation labels for industrial use cases are scarce.

Label-Efficient Segmentation

The research community actively develops label-efficient image segmentation algorithms to alleviate the problem of label scarcity in semantic segmentation. Figure 2 shows a taxonomy of different approaches for label-efficient segmentation, which we will discuss. The diagram indicates that label-efficient semantic segmentation can broadly be taxonomized into unsupervised and weakly supervised methods. Weakly supervised semantic segmentation can further be categorized into coarsely- and semi-supervised. The remainder of this article will explore some examples of how coarsely supervised, semi-supervised, and unsupervised semantic segmentation work.

Figure 2: A taxonomy of label-efficient semantic segmentation. Graphic inspired by Shen et al. 2023.

Coarse Supervision

Coarsely supervised approaches aim at learning to predict dense segmentation maps from only image-level or box-level supervision (i.e., using only global classification labels or object detection annotations). Let’s investigate how to train segmentation models with image-level annotations only. These approaches typically employ class activation maps (CAM) (Zhou et al. 2016) to bootstrap segmentation models.

Supervision through Class Activation Maps

Figure 3 shows how Zhou et al. extract the class activation maps. First, a convolutional neural network (CNN) is trained on a supervised image classification task. A core component of the CNN’s architecture is a global average pooling (GAP) layer after the last convolutional layer, which considers all activations in a channel (as opposed to the standard global max pooling, which only forwards the largest activation). The activations of the last convolutional layer are passed through the GAP layer, which accumulates the channel-wise activation activity into a single vector. A linear layer is then trained to map from this vector to logits and class probabilities. Note that the activation maps are upsampled to match the size of the input image. After training the network, the class activation maps can simply be extracted by computing a weighted sum of the linear classifier’s weights corresponding to the “Australian terrier” class and the last-layer channel activations, as indicated in Figure 3.

Figure 3: Extraction of class activation maps (CAM) enabled by global average pooling (GAP) and a linear classifier (Zhou et al. 2016).

Looking closely at the class activation map of the “Australian terrier” in the bottom right of Figure 3, we notice that it already greatly resembles a segmentation map, as seen in Figure 1d. Zhou et al. already show in their paper that they obtain competitive scores on object detection benchmarks by using the CAM alone to generate bounding box predictions. Remember that these approaches use only image-level supervision but can be used to bootstrap object detection using no additional labels.

How can CAM be used for weakly supervised semantic segmentation? A common practice is to bootstrap CAM for generating pseudo labels, on which a semantic segmentation model is later trained in a supervised fashion.

Class Activation Map Quality Matters

Figure 4 shows a classification model’s CAM of the target object “horse” and the image background. All weakly supervised semantic segmentation approaches based on CAM face one core problem: only a small part of the object has high-class activations. For example, in Figure 4b, we see that only the horse’s head is confidently highlighted, and the rest of its body has low-class activations. However, suitable pseudo labels for semantic segmentation should include all pixels corresponding to the entity “horse”, i.e., the horse’s body should also be highlighted.

Figure 4: (a) Input image (b) extracted CAM of the object (brighter means more confident) and © CAM of background (darker means more confident) (Ahn et al. 2018).

Improving Class Activation Maps with AffinityNet

Hence, much research is going into adapting CAM to form pseudo-labels that recover the entire object area more accurately. A seminal approach here was AffinityNet (Ahn et al. 2018). The authors frame their solution as a region-growing task aided by a proxy network called AffinityNet. AffinityNet predicts semantic similarities between a given location in the image’s feature map and its immediate neighborhood. Figure 5 shows how AffinityNet’s training labels are constructed. First, source locations are sampled, and the relations between each location’s neighboring points are categorized into either “positive”, “negative,” or “don’t care”. The label is positive if two coordinates stem from the same class, negative if they stem from different classes, and ignored if at least one of the coordinates is from a neutral area with low confidence. Finally, these labels can be directly used in a cross-entropy-based loss function to train AffinityNet.

Figure 5: Generation of semantic affinity labels (Ahn et al. 2018). (a) shows confident predictions for “person” in peach, “plant” in green, and “background” in black. Neutral areas are color-coded white. (b) sampling coordinate pairs to build semantic affinity labels.

The prediction of AffinityNet can be interpreted as a transition probability matrix on a local graph and is used to propagate the class activations to nearby areas of the same semantic entity. The semantic affinities predicted by AffinityNet are used in a random walk to stretch the CAM-based seed area until the object’s semantic boundaries. Figure 6 summarizes Ahn et al.’s three-step approach for weakly supervised semantic segmentation: (1) train AffinityNet to predict semantic affinities in an image’s feature map, (2) use AffinityNet to guide random walk for extending the class activation maps to match the entire object, (3) use the pseudo-ground-truths as labels to train a semantic segmentation model. Note how in the middle of Figure 6 AffinityNet has successfully extended and revised the CAM to better match the horse’s body via random walk guided by AffinityNet’s learned operator, the affinity matrix. These improved pseudo-segmentation labels boost the performance of the downstream training of the segmentation model in step (3).

Figure 6: Ahn et al.’s three-step pipeline for weakly supervised semantic segmentation.

Other Approaches Using Class Activation Maps

There are several extensions and alternatives to Ahn et al.’s work. More recently, Ahn et al. presented IRNet (Ahn et al. 2019), which extends their approach towards instance segmentation, relying on similar ideas of growing class activation regions. The SEAM algorithm (Wang et al. 2020) enforces consistency of the CAM over affine transformations and learns a pixel correlation module, also improving the match of the revised CAM and the ground-truth object contour. Finally, class activation maps for weakly supervised semantic segmentation are recently used in conjunction with the Vision Transformer (ViT) architecture (Dosovitskiy et al. 2021), whereby CAM and attention masks are combined for higher-quality pseudo masks (Huang et al. 2022).

Semi-Supervision

Semi-supervised semantic image segmentation aims at learning segmentation models from a combination of labeled and unlabeled images. Semi-supervision is particularly interesting in industrial applications since expensive labeling is only necessary for a subset of the collected data, while the remaining unlabeled images can still be incorporated into the training process.

The Self-Training Paradigm

Figure 7 shows a high-level overview of the default approach for semi-supervised semantic segmentation, known as self-training. The self-training framework uses a small subset of labeled images to train a teacher model in a supervised fashion. Consequently, the teacher model generates pseudo masks, i.e., noisy segmentation targets from the unlabeled data. These noisy pseudo-segmentation maps are later used to train another segmentation network, the student, in a supervised fashion.

Figure 7: Self-training: a high-level illustration of the default approach for semi-supervised semantic segmentation (Shen et al. 2023).

Improving Pseudo-Segmentation Masks

Intuitively, the performance of the student model is greatly influenced by the quality of the pseudo masks. Hence, a lot of research is focused on generating improved pseudo-segmentation labels. Some methods assess and threshold the pseudo mask’s reliability. For example, identifying the pseudo masks with low confidence and neglecting them when training the student model can improve performance (Hung et al. 2018, He et al. 2021).

Other algorithms introduce self-supervision through consistency-regularization to improve pseudo-mask quality. Thereby, some perturbations are induced into the architecture or input images that we know the network should be invariant to after training. For example, the CPS algorithm (Chen et al. 2021) trains the student and teacher as Siamese networks: both networks share the same architecture, but their weights are initialized differently. The labeled images are now used in a standard cross-entropy loss. The unlabeled data is leveraged as a source of self-supervision by matching the output distributions of the student and the teacher, incentivizing the networks to predict consistent segmentation masks regardless of the initialization. Other approaches enforce consistency via strong data augmentations on the unlabeled images and combine this learning signal with standard cross-entropy from the labeled images (French et al. 2019). Check out our previous blog posts (part 1, part 2) for more information on self-supervised learning and how data augmentations can help learn image representations.

No Supervision

Semantic segmentation has also been attempted without using human-annotated labels. A core idea behind unsupervised semantic segmentation is clustering dense visual descriptors of an image into perceptual groups. Note that these groups ideally carry semantic meaning and relationships, but since no labels are available, no human-interpretable classes like “cat” or “dog” can be directly predicted. For evaluation purposes, it is therefore common to use a small labeled dataset to match the discovered semantic groups to human-interpretable classes, e.g., as in (Cho et al. 2021) or (Hamilton et al. 2022).

Graph Cuts and Spectral Methods

How can semantic segmentation be approached without the use of labels? A computer vision classic, the Normalized Cuts algorithm (Shi et al. 2000), frames the task as a graph partitioning problem on the input image. First, the image is represented as a graph by defining a vertex at each pixel and defining edges between neighboring pixels, whereby the edge weight defines the similarity between the corresponding vertices. More formally, the edge weight measures the likelihood that the two pixels stem from the same object. Now a so-called “Min-Cut” aims to partition the graph into disjoint subsets that minimize the “cost” of the cut, i.e., the sum of the cut edge weights. Furthermore, the sub-graph size must be normalized to avoid over-fragmentation, yielding the “Normalized Cut”. The solution to this minimization is usually approximated via eigenvector-based methods, also known as spectral methods. Figure 8 shows an image of two baseball players and the seven ontologies detected through Normalized Cuts. Note that the algorithm is fit to each input image individually, i.e., the solution does not generalize to other images.

Figure 8: Early attempts at unsupervised semantic segmentation using Normalized Cuts (Shi et al. 2000). (a) shows the input image, and (b-h) illustrates seven detected groups corresponding to semantic entities.

Deep Graph Cuts

While the Normalized Cuts algorithm (Shi et al. 2000) algorithm used simple image features like brightness and pixel location, the approach has recently inspired work with features from deep neural networks. As noted in our previous blog posts on Vision Transformers (part 1, part 2), self-supervision leads to interesting emerging properties in combination with the ViT architecture. In particular, the DINO pre-training objective (Caron et al. 2021) produces semantically consistent patch-wise image embeddings that can be effectively used for various downstream tasks such as image classification and video instance segmentation.

A DINO-pre-trained ViT backbone has been successfully used with graph cuts (Melas-Kyriazi et al. 2022). The authors construct a semantic affinity graph from the embedded image patches and low-level color information of the input image. Similar to Normalized Cuts, they decompose the image into soft segments by calculating the eigenvectors of the Laplacian matrix of the affinity graph. As shown in Figure 9, the eigenvectors already correspond to semantically meaningful regions in the input image. The authors note that, generally, the eigenvector with the smallest nonzero eigenvalue (i.e., the 1st eigenvector in Figure 9) corresponds to the main object in the image. In the second rightmost column of Figure 9, Melas-Kyriazi et al. extract a bounding box from the identified segment and achieve state-of-the-art unsupervised object localization. Unsupervised semantic segmentation is done by clustering the deep features of the identified regions into semantically related groups.

Figure 9: Spectral methods applied on a combination of deep ViT-features and color information (Melas-Kyriazi et al. 2022). The eigenvectors of the graph’s Laplacian matrix already segment semantic groups, which helps to bootstrap object localization and semantic segmentation.

Clustering and Self-Supervision

Furthermore, standard k-means-style clustering (Amir et al. 2022), or matrix factorization (Collins et al. 2018) of features from pre-trained backbones were successfully used to detect related ontologies for unsupervised semantic segmentation. Other methods use self-supervision to learn visual descriptors for unsupervised segmentation. For instance, PiCIE (Cho et al. 2021) clusters pixel-level features by optimizing a self-supervised loss that enforces invariance to photometric and equivariance to geometric transformations. Self-supervision is also often used for feature distillation of pre-trained backbones before clustering. For example, the MaskContrast (Van Gansbeke et al. 2021) algorithm uses a pre-trained network to extract a binary mask of an image’s main object, also referred to as saliency estimation, contrastively trains a separate network that maximizes agreement of the pixel embeddings with the extracted object mask, and finally clusters the pixel embeddings using k-means. MaskDistill (Van Gansbeke et al. 2022) uses object masks from clustered ViT-features as pseudo-ground-truth labels to train a second network on high-confidence segmentation masks. While previous feature distillation approaches introduce a proxy network, the STEGO algorithm (Hamilton et al. 2022) directly utilizes a ViT backbone, distills the produced latent features, and clusters them into distinct ontologies — promising a simple and efficient training strategy.

Unsupervised Segmentation with STEGO

As the last part of this blog post, let’s briefly walk through the STEGO algorithm, a state-of-the-art method for unsupervised semantic segmentation. Figure 10 shows the architecture of the STEGO method. The approach uses an ImageNet-pre-trained DINO backbone for feature extraction. Note that DINO only uses the ImageNet images (not labels!) for training. The ViT processes the embedded image patches (i.e., tokens) through several transformer layers. Consequently, the ViT’s output tokens are reshaped into an image-like feature map. The STEGO architecture adds three main modules on top of DINO. First, an optional bilinear upsampling layer is added to upsample the feature map to regain the original input image resolution. Then, an unsupervised segmentation head projects the DINO features into a lower-dimensional space. Lastly, a cluster probe identifies clusters in the segmentation head’s output with a k-Means algorithm. The identified clusters are matched to human interpretable labels via the Hungarian algorithm using few labels (Kuhn 1955). A supervised linear probe additionally evaluates the feature quality of the segmentation head’s output. A stop-gradient operation ensures that the simultaneous optimizations of both probes and the segmentation head do not influence each other and that supervised label information propagates neither into the segmentation head nor the backbone. Finally, the output is optionally refined using a Conditional Random Field (CRF) (Krähenbühl et al. 2011).

Figure 10: Architecture of the STEGO pipeline (Hamilton et al. 2022). Graphic from (Koenig et al. 2023).

Hamilton et al. notice that the DINO backbone already exhibits remarkable feature correspondences. In Figure 11, the authors demonstrate high cosine similarity between a source location in one image and all other semantically related locations in the same or different images. For instance, let the ViT source token be at the red cross pointing at the motorcyclist, then all tokens in the same image belonging to the motorcyclist have high cosine similarity. Brighter red means higher cosine similarity. Furthermore, this also holds for motorcyclists in other images, as seen on the right in Figure 11. An identical pattern can be observed for other semantic classes, such as the sky (in blue) or the ground (in green).

Figure 11: Feature correspondences in the DINO backbone, from which STEGO builds a self-supervised learning signal to train its segmentation head (Hamilton et al. 2022).

As noted before, the segmentation head is a core component of the STEGO method. Hamilton et al. exploit their observations on the existing feature correspondences in the DINO output to train the segmentation head. The authors construct a contrastive training objective to amplify the feature correspondences already present in the DINO backbone and make them more consistent across the training dataset. The segmentation head produces segmentation features whose correspondences are trained to be high when the DINO feature correspondences are large and low for DINO features that have low correspondence. Combined with the cluster probe, segmentation maps can be learned without explicit supervision. Figure 12 shows the impressive unsupervised segmentation performance of the STEGO algorithm.

Figure 12: Qualitative results of STEGO’s cluster probe on COCO-Stuff validation dataset (Hamilton et al. 2022). The segmentation head and cluster probe are trained using no labels.

Looking Closer at STEGO

Our recent publication (Koenig et al. 2023) provides a deeper understanding of the STEGO architecture and training strategy to ensure its well-informed usage and motivate further development. First, we reproduced and extended STEGO’s experimental validation, demonstrating a stronger baseline performance of the DINO backbone than reported in the STEGO paper. Second, we identified some missing ablation studies investigating the contribution of this new approach in more detail. We show that the working mechanism of the segmentation head is two-fold. Firstly, our experiments indicate that the segmentation head acts as a dimensionality reduction technique, making it easier for the k-Means-based cluster probe to identify clusters in the segmentation head’s output. The performance of k-Means worsens in larger dimensions due to the curse of dimensionality. Secondly, the segmentation head non-linearly projects the ViT features, allowing the STEGO algorithm to adapt to the new training data distribution, likely different from ImageNet, on which the DINO strategy pre-trains.

Conclusion

In this blog post, we covered what label-efficient segmentation is and why it is a relevant problem, especially in industry settings. We walked through some core approaches that use varying degrees of supervision and dove deeper into some algorithms (e.g., AffinityNet, Normalized Cuts, and STEGO). Coarsely supervised methods can employ global image labels. Semi-supervision uses incompletely labeled datasets to perform semantic segmentation, and unsupervised semantic segmentation methods operate on no annotations. Undoubtedly, the field is gaining momentum, and advances in self-supervised representation learning are only bound to accelerate progress in the future.

Bibliography

Wu, Xiongwei, Doyen Sahoo, and Steven C H Hoi. Recent advances in deep learning for object detection. Neurocomputing, 2020.
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Conference on Computer Vision and Pattern Recognition, 2021.
Ahmad Alzu’bi and Lujain Alsmadi. Monitoring deforestation in Jordan using deep semantic segmentation with satellite imagery. Ecological Informatics, 2022.
Monjoy Saha and Chandan Chakraborty. Her2Net: A deep framework for semantic segmentation and classification of cell membranes and nuclei in breast cancer evaluation. IEEE Transactions on Image Processing, 2018.
Wei Shen, Zelin Peng, Xuehui Wang, Huayu Wang, Jiazhong Cen, Dongsheng Jiang, Lingxi Xie, Xiaokang Yang, and Q. Tian. A survey on label-efficient deep image segmentation: Bridging the gap between weak supervision and dense prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Conference on Computer Vision and Pattern Recognition, 2018.
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Conference on Computer Vision and Pattern Recognition, 2016.
Jianqiang Huang, Jian Wang, Qianru Sun, and Hanwang Zhang. Attention-based class activation diffusion for weakly-supervised semantic segmentation. Arxiv Pre-Print, 2022.
Jiwoon Ahn, Sunghyun Cho and Suha Kwak. Weakly supervised learning of instance segmentation with inter-pixel relations. In Conference on Computer Vision and Pattern Recognition, 2019.
Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Conference on Computer Vision and Pattern Recognition, 2020.
Wei-Chih Hung, Yi-Hsuan Tsai, Yan-Ting Liou, Yen-Yu Lin, and Ming-Hsuan Yang. Adversarial learning for semi-supervised semantic segmentation. In British Machine Vision Conference, 2018.
Ruifei He, Jihan Yang, and Xiaojuan Qi. Re-distributing biased pseudo labels for semi-supervised semantic segmentation: a baseline investigation. In International Conference on Computer Vision, 2021.
Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. Semi-supervised semantic segmentation with cross-pseudo supervision. In Conference on Computer Vision and Pattern Recognition, 2021.
Geoff French, Samuli Laine, Timo Aila, Michal Mackiewicz, and Graham Finlayson. Semi-supervised semantic segmentation needs strong, varied perturbations. In British Machine Vision Conference, 2019.
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.
Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and Andrea Vedaldi. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Conference on Computer Vision and Pattern Recognition, 2022.
Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Hariharan. PiCIE: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Conference on Computer Vision and Pattern Recognition, 2021.
Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T. Freeman. Unsupervised semantic segmentation by distilling feature correspondences. In International Conference on Learning Representations, 2022.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision, 2021.
Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep ViT features as dense visual descriptors. European Conference on Computer Vision Workshop, 2022.
Edo Collins, Radhakrishna Achanta, and Sabine Susstrunk. Deep feature factorization for concept discovery. In European Conference on Computer Vision, 2018.
Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Hariharan. PiCIE: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Conference on Computer Vision and Pattern Recognition, 2021.
Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Unsupervised semantic segmentation by contrasting object mask proposals. In International Conference on Computer Vision, 2021.
Wouter Van Gansbeke, Simon Vandenhende, and Luc Van Gool. Discovering object masks with transformers for unsupervised semantic segmentation. Arxiv Pre-Print, 2022.
Alexander Koenig, Maximilian Schambach, and Johannes Otterbach. Uncovering the inner workings of STEGO for safe unsupervised semantic segmentation. In Conference on Computer Vision and Pattern Recognition Workshop, 2023.
Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Advances in Neural Information Processing Systems, 2011.
Harold W Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955.