Transformers Make Strong Encoders| TransUNet

Bring together the locality of CNNs and the attention mechanism of Transformers for medical image segmentation

Published in

Artificial Corner

3 min readJun 23, 2023

U-Net[1] includes a symmetric encoder-decoder network which emerged as the de-facto standard. but due to the intrinsic locality of convolution operations, these architectures generally demonstrate limitations for modeling global context.

Vision transformers[2] employed self-attention mechanisms for modeling global context. However, this can lead to limited localization abilities.

TransUNet[3] employs a hybrid CNN-Transformer architecture to take advantage of both detailed high-resolution spatial information from CNN features and the global context in Vision Transformers.

Overview
TransUNet Architecture
Experimental Results

Overview

Initially, CNN is utilized as the feature extractor to generate local features, and then, feature maps serve as the inputs for the Vision transformer which is the encoder of U-shape TransUNet.

Like U-Net, TransUnet includes two main sections:

Encoder (Vision Transformer): Vision Transformer encodes patches of feature maps that are generated by a convolution neural network.
Decoder (Upsampling): the decoder upsamples the encoded features which are then merged with the high-resolution CNN feature maps to achieve precise segmentation.

TransUNet is evaluated using Synapse multi-organ segmentation and Automated cardiac diagnosis challenge datasets.

TransUNet Architecture

Transformer as Encoder

Firstly, the input(X) is partitioned into N 2D patches(Xp) with the size of P*P. Then patches are flattened and projected into a D-dimensional embedding space by linear projection. To preserve the spatial information, position embedding is added:

Where E is the patch embedding projection, and Epos is position embedding.

The embedded sequences passed into the transformer to encode feature representation which includes layer normalization(LN) operator, Multihead Self-Attention (MSA), and Multi-Layer Perceptron (MLP) blocks.

TransUNet Decoder

The Cascaded UpSampler(CUP) is employed to reach the full resolution from H/P * W/P to H*W. CUP includes multiple upsampling blocks where each block consists of a 2*upsampling operator, a 3*3 convolution layer, and a ReLU layer.

CNN-Transformer Hybrid as Encoder

Rather than using the pure Transformer as the encoder, TransUNet employs a CNN-Transformer hybrid model where CNN is first employed to extract features from the raw image. Then patch embedding is applied to patches from the CNN feature maps instead of raw images.

CNN-Transformer hybrid enables us to take advantage of the intermediate high-resolution CNN feature maps in the decoding section.

Experimental Results

Implementation details: data augmentation(random rotation, flipping), Optimizer(SGD, learning rate 0.01, momentum 0.9, weight decay 1e-4), batch size 24.

For the pure Transformer-based encoder, Vision Transformer[2] with 12 Transformer layers is employed. (ViT)
For the hybrid encoder design, ResNet-50 and Vision Transformer[2] are implemented. (R50-ViT)

an experiment is conducted to compare TransUNet with the previous state-of-the-art using Synapse multi-organ segmentation dataset.

Table 1. Comparison on the Synapse multi-organ CT dataset (average Dice Score
% and average Hausdorff Distance in mm, and Dice Score % for each organ).

As illustrated in table 1:

Compared with ViT-None, ViT-CUP exhibits an improvement.
Similarly, compared with ViT-CUP, R50-ViT-CUP also suggests an additional improvement of 3.43%in DSC and 3:24 mm in Hausdorff distance demonstrating the efficiency of the hybrid encoder.
Built upon R50-ViT-CUP, the proposed TransUNet which is also equipped with skip-connections, achieves the best performance.

References

[1] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation (2015)

[2] A. Dosovitskiy, L. Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, M. Dehghani, Matthias Minderer, G. Heigold, S. Gelly, Jakob Uszkoreit, N. Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)

[3] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, E. Adeli, Yan Wang, Le Lu, A. Yuille, Yuyin Zhou, TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (2021)