How do Vision Transformers work? An Image is Worth 16x16 Words

Published in

CodeX

5 min readJul 30, 2021

Transformers, an architecture fully made up of attention has outrivaled the competing NLP models after its release. These powerful models are very efficient and can scale up to billions, or even trillions of parameters with the recent release of GPT-4. They benefit from the growing dataset sizes and computation limits. They also generalize well to other applications, illustrated by the huge success of pre-trained BERTs being fine-tuned and applied to many applications.

However, previous applications of fully-attention networks in large-scale computer vision weren’t so successful. Mostly because previously proposed self-attention mechanisms were infeasible in medium/large images since the complexity relied on the number of pixels. Some models were very tricky to accelerate with GPUs, just like why recurrent models failed to keep up with transformers.

Say hello to vision transformer (ViT), which applies images to transformers in a naive approach by splitting the image into multiple patches. Regarding image patches as words, ViT provides embeddings of the patches to the transformer. ViT achieves competitive performance in mid-sized datasets such as ImageNet and CIFAR100. Results are even further improved when applied to larger datasets, where ViT was able to achieve similar results or beat CNNs in some benchmarks.

Let’s get into how ViT works and why they are good and bad at some problems in this post. The paper is available at the link below.

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Disclaimer: This post doesn’t explain the concepts of the original transformer model, nor the building blocks of transformers such as multi-head attention.

Overview on ViT architecture

The input image is first sliced into patches of size P×P. Each patch is flattened and linearly mapped into a D dimension vector, referred to as the embedding stage in the ViT. Position embeddings from the original transformer and class tokens are added to the patch embedding. The position is fed as a single number, since a 2D position embedding based on x, y positions wasn’t helpful to the model. The process above converts image patches into tokens. The processed token input is no different from regular NLP tasks. Therefore, no modification is made to the encoder transformer model.

ViT can also incorporate CNNs for further improving performance. The paper proposes a hybrid model by feeding feature maps computed by CNNs instead of feeding the raw image.

Training

ViT is pre-trained on large datasets and tine-tuned to a smaller dataset. Since ViT can receive sequences of arbitrary lengths, the number of patches changes to keep the patch size is kept consistent. This way, the linear projection, and input layers are kept when transferring between datasets. Only the final prediction layer, due to the varied class types is replaced when changing the dataset.

Experiments

The paper shows three variants of the ViT used in experiments sized based on BERT. The first experiment varies the network architecture and compares it with the previous SOTA: Noisy Student on many datasets. When the ViT-H model(ViT-Huge) is pre-trained on the large JFT-300M dataset, it outperforms the previous baselines in almost all metrics. Even taking substantially less time to train. The experiment is described in the figure below.

Needs of large pre-training dataset

The ViT architecture described rarely utilizes the 2D structure of the image. The authors believe that this will lead to less image-specific inductive bias than CNNs, and the model must learn the spatial relationships from scratch. This naturally raises the need for larger datasets to train/pre-train on.

The figure above illustrates that the larger models(ViT-H, ViT-L) perform worse on relatively smaller training datasets. They seem to easily overfit the dataset. Considering that the large models were regularized heavily, we can see that the required number of images for truly exploiting the computational capabilities of ViT is very large.

Scaling the model for larger Transformers

Another experiment assesses the training computation needs and the accuracy, often known as the Pareto curve of a model. ViT is always better than its competitor CNNs in terms of the complexity tradeoff. When comparing the hybrid model which utilizes CNN features and the naive model that doesn’t, hybrid models do seem to outperform ViT on small settings. However, the performance gap seems to disappear after some scaling. Surprisingly, CNNs might not be needed for ViT. Another finding is that the performance graph doesn’t seem to saturate yet, which is different from the Pareto curves of regular CNNs. This leaves the possibility of further scaling to improve performance.

Conclusion

In this post, we reviewed the initial vision transformer architecture and the properties of ViTs discovered from experiments. ViT converts image patches into tokens, and a standard transformer is applied directly to the tokens, interpreting them as word embeddings. Experiments showed promising results in image classification compared to CNNs. While they seem to require large datasets to train and pre-train competitively, they have the potential to scale even further.

The paper also suggests that another challenge to ViTs is the applications to other computer vision tasks such as segmentation and object detection. To me, these tasks can’t be solved using the ViT approach of cropping the image into patches. But applications of transformers in vision are gaining more and more interest. SOTA CNNs have been replaced with transformers in many fields. Keep an eye on the progression of vision transformers.