A Comprehensive Look into CageViT

Published in

Artificial Corner

4 min readJun 11, 2023

Convolutional Activation Guided Efficient Vision Transformer (CageViT) to focus more on the major tokens

CageViT is a new vision transformer architecture that uses convolutional activation to divide each token into major and minor. the main aim of this technique is to reduce the computational cost by focusing more on the major tokens. It also incorporates a multi-head fusion module and a gated linear spatial reduction attention mechanism to better capture contextual information and improve classification accuracy. The proposed CageViT is evaluated on ImageNet-1k and achieves state-of-the-art performance, outperforming previous transformer-based architectures with similar model complexity.

Overview
CageViT Architecture
1.1 Convolutional Activation Guided Attention
1.2 Gated Linear Spatial Reduction Attention (Gated Linear SRA)
Experimental Results

Overview

The proposed CageViT model is developed in three steps:

Selection and rearrangement of major and minor tokens:
The class activation map, Grad-CAM++, is utilized to generate heat maps for categorizing patches into major and minor tokens. this technique enables the model to distinguish between the most significant patches, known as major tokens, and the less significant patches, known as minor tokens. Then, minor and major tokens are rearranged, and Linear embedding is employed for each token.
Fusion of minor tokens:
minor tokens combined into multiple fusion tokens to preserve the background knowledge and reduce computational costs. Then, the positional embedding and fusion embedding is incorporated into tokens. Finally, the outputs are fed into the vision transformer encoder.
Redesign of the Attention mechanism (Gated Linear SRA):
to improve the communication between major tokens and fusion tokens, a novel attention mechanism is proposed.

CageViT Architecture

Convolutional Activation Guided Attention

In the original Vision Transformer, by performing the inner product between Q and K, we have the time and space complexity of O(N2). One strategy for computational efficiency is reducing the number of tokens fed into the vision transformer encoder.

After partitioning the input image into tokens, a salience map is calculated for each token to determine their importance. each token is divided into the most important tokens, major tokens, and the less important tokens, minor tokens. Then, the minor and major tokens are rearranged and passed through the linear projection layer.

all minor tokens are combined into multiple fusion tokens with all the background knowledge. By utilizing this technique, computational complexity is greatly reduced while still, all the important major tokens are preserved for accurate and precise classification. Then, positional and fusion embeddings are added to tokens, and the results are fed into the Efficient Transformer Encoder.

Gated Linear Spatial Reduction Attention (Gated Linear SRA)

Figure 2. Gated Linear Spatial Reduction Attention

Gated Linear Spatial Reduction Attention (Gated Linear SRA) is a redesigned layer in the CageViT architecture that better fits the Convolutional Activation Guided Attention technique.

As depicted in Fig 2. All the fusion tokens are merged and fed into the Gate layer to incorporate the background information stored in the minor tokens. The Gate module is a two-layer MLP that maps multiple fusion tokens into dimensions that can exchange information with the average pooled value token.

The proposed gating mechanism improves the communication between major tokens and fusion tokens by allowing all major tokens to interact with minor tokens while being supervised by fusion tokens, i.e., they are gated by the fusion tokens, requiring the major token to be aware of the background information.

Experimental Results

CageViT model variants: CageViT-T, CageViT-S, CageViT-B, CageViT-L (Tiny, Small, Base, and Large)

Configuration: Adam optimizer, cosine decay learning rate scheduler with 5 epoch linear warm-up, Batch size 2048, Initial learning rate 5e-4, weight decay 0.05, gradient clipping with a max norm of 5.

an experiment was conducted to compare the performance of the proposed CageViT model to other backbones, including Transformer-based models and ConvNet-based models. As illustrated in Tab 1. The proposed CageViT outperforms existing cutting-edge models.

Table 1. Comparison with state-of-the-art backbones on ImageNet-1k benchmark.

References

[1] Hao Zheng, Jinbao Wang, Xiantong Zhen, H. Chen, Jingkuan Song, Feng Zheng, CageViT: Convolutional Activation Guided Efficient Vision Transformer (2023)