Faster Encoders with Transformers for Segmentation | LeViT U-Net

Golnaz Hosseini
Artificial Corner
Published in
4 min readJul 2, 2023

LeViT Transformer + U-Net architecture for fast and accurate semantic segmentation

Tim Rüßmann on Splash

LeViT-UNet[2] is a new architecture for medical image segmentation that uses a transformer encoder instead of a convolutional encoder, which allows it to learn long-range dependencies more efficiently. This makes LeViT-UNet[2] faster than traditional U-Nets while still achieving state-of-the-art segmentation performance.

Contents

  • Overview
  • Datasets
  • LeViT-UNet Architecture
  • Experimental Results

Overview

LeViT[1] was used as the encoder of LeViT-UNet[2] because it strikes a better balance between accuracy and efficiency than other transformer blocks. Multi-scale feature maps from transformer blocks and convolutional blocks of LeViT were passed to the decoder via skip connections. This was done to allow the decoder to effectively reuse the spatial information of the feature maps, which improved the performance of LeViT-UNet.

LeViT-UNet[2] achieved better performance than other competing methods on several challenging medical image segmentation benchmarks, including Synapse multi-organ segmentation dataset (Synapse) and the Automated cardiac diagnosis challenge dataset (ACDC).

LeViT-UNet Architecture

LeViT-UNet is a new model for 2D medical image segmentation that is inspired by the LeViT transformer. LeViT-UNet aims to create a faster encoder that can still achieve good segmentation performance.

The LeViT-UNet model consists of an encoder, a decoder, and skip connections. The encoder is built using LeViT transformer blocks, which are designed to be efficient and effective at learning global features. The decoder is built using convolutional blocks.

The encoder extracts feature maps from the input image at multiple resolutions. These feature maps are then upsampled, concatenated, and passed into the decoder with skip connections. The skip connections allow the decoder to access the high-resolution local features from the encoder, which helps to improve the segmentation performance.

this design allows the model to integrate the strengths of both transformers and CNNs. Transformers are good at learning global features, while CNNs learn local features. By combining these two approaches, LeViT-UNet is able to achieve good segmentation performance while also being relatively efficient.

Figure 1. The Architecture of LeViT UNet

LeViT as Encoder

LeViT[1] is employed as the encoder, which consists of two main components: convolutional blocks and transformer blocks. The convolutional blocks perform the resolution reduction by applying four layers of 3x3 convolutions with stride 2 to the input image. This reduces the image's resolution by half while extracting more abstract features. The transformer blocks then take the feature maps of the convolutional blocks and learn global features.

The features from the convolutional blocks and transformer blocks are then concatenated in the last stage of the encoder. This enables the encoder to merit both local and global features. Local features are important for identifying small, detailed objects in an image, whereas global features are important for identifying the overall structure of an image. By combining both local and global features, the encoder is able to generate more accurate segmentations.

Three types of LeViT encoders have been developed based on the number of channels fed into the first transformer block: LeViT-128s, LeViT-192, and LeViT-384.

Figure 2. Block diagram of LeViT-192 architecture

CNNs as Decoder

LeViT-UNet’s decoder concatenates the features from the encoder with skip connections. This enables the decoder to access the high-resolution local features from the encoder, which helps to improve the segmentation performance. The cascaded upsampling strategy is employed to recover the resolution from the previous layer using CNNs. It consists of a series of upsampling layers, each of which is followed by two 3x3 convolution layers, a batch normalization layer, and a ReLU layer.

Experimental Results

Implementation details: data augmentation(random flipping and rotations), Optimizer(Adam, learning rate 1e-5, weight decay of 1e-4), image size 224x224, Batch Size 8, Epochs 350 and 400 for Synapse and ACDC dataset

As shown in Table 2, the suggested LeViT model outperformed the existing models and is significantly faster than TransUNet, which incorporates the Transformer block into CNN.

Table 1. Mean DSC and HD of the proposed LeViT-UNet compared to other state-of-the-art semantic segmentation methods on the Synapse dataset

Figure 3 shows the qualitative segmentation results of four different methods: TransUNet, UNet, DeepLabv3+, and LeViT-UNet. The other three methods are more likely to under-segment or over-segment the organs. For example, the stomach is under-segmented by TransUNet and DeepLabV3+ (as indicated by the red arrow in the third panel of the upper row), and over-segmented by UNet (as indicated by the red arrow in the fourth panel of the second row).

The proposed LeViT-UNet outputs are relatively smoother than those from other methods which indicates that LeViT-UNet has more advantages in boundary prediction.

Figure 3. Qualitative comparison of various methods by visualization From Left to right: Ground Truth,
LeViT-UNet-384, TransUNet, UNet, and DeepLabv3+.

References

[1] Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv’e J’egou, Matthijs Douze, LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, 2021

[2] Guoping Xu, Xingrong Wu, Xuan Zhang, Xinwei He, LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation, 2021

--

--