Universal Vision Transformer Simple Yet Effective (Paper Explained)

3 min readJun 15, 2023

Here I have tried to explain the main idea of the paper “A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation” briefly. For more details, please refer to the original paper.
If you are already familiar with Vision Transformer, you can skip directly to section 3.

1. An Introduction to Vision Transformer (ViT):

The Vision Transformer(ViT) is a transformer architecture that deals with images and has shown highly competitive performance against state-of-the-art CNN models. The emergence of ViT models has introduced a new way of handling computer vision tasks. The model architecture was introduced in a research paper published as a conference paper at ICLR 2021 titled “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale”. It was developed by the Google Research Brain Team and pre-trained on the ImageNet and ImageNet-21k datasets.

2. An Overview of How Vision Transformer (ViT) Works:

Assuming that the reader already knows how ViT works, I am going to explain it briefly to set the tone for the upcoming concepts. ViT works with image patches by splitting an image into fixed-sized patches and adding positional embeddings. The sequence of patch embeddings is then fed as an input to a transformer encoder consisting of multiple stacked transformer blocks. Each transformer block consists of two sublayers: a multi-head self-attention layer and a feed-forward layer. This helps the model capture relationships between different patches. The model is usually pre-trained on a large dataset and then fine-tuned on a particular computer vision task using labeled data.

The Architecture of the Vision Transformer (ViT)
Taken from “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale“

3. Exploring the Concept of Universal Vision Transformer:

The idea of the Universal Vision Transformer (UViT) is to keep things simple. The motivation behind the UViT design is not to add any layers to the ViT architecture to make it a complex design. Instead, a vanilla ViT architecture plus a better depth-width trade-off can achieve high performance. In UViT, image patches plus position embeddings are processed by a stack of vanilla attention blocks with a constant resolution and hidden size. The output single-scale feature maps are then fed into head modules for detection or segmentation tasks. The computation cost can be reduced by using a constant or progressive attention window.

The Architecture of the Universal Vision Transformer (UViT)
Taken from “*A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmen*tation“

By studying various UViT models, the researchers have shown UViT with 18 attention blocks under 896x896 input size achieves the best performance-efficiency trade-off.

As we know, self-attention is a global operator. But in UViT, the attention heads in the early attention layers act like a local operator while in the deeper layers, due to the increased receptive field, the attention heads act like a global operator. Therefore, we can limit the attention range of early layers to reduce computation costs without any performance loss.

The researchers found that the window scale for the first 14 attention blocks will be 1/4 scale window, for the next 2 attention blocks it will be 1/2 scale window and for the last 2 attention blocks, it will be 1 scale window. This is the progressive attention window strategy mentioned above.

4. Why UViT Could Be a Better Approach:

While ViT has shown promising results in image classification, more recent works have tried to customize ViT architectures to be CNN-like to solve various vision problems like object detection and semantic segmentation. However, these black box design conventions have no clear understanding of individual benefits. The previous design conventions and scaling laws for CNNs may not be suitable for ViTs. Unlike CNN, which provides local features, ViT’s self-attention layers provide global features, which is the reason for ViTs being data-hungry models.

This implies that the simple UViT architecture is strong and efficient enough without introducing any design overhead. For the results of the UViT model on various datasets, you can refer to the original paper.

Universal Vision Transformer Simple Yet Effective (Paper Explained)

Written by Sadman Shakib