A Brief History of Vision Transformers:
Revisiting Two Years of Vision Research

Merantix Momentum

Published in

Merantix Momentum Insights

12 min readOct 19, 2022

Part I: Self-attention and the Vision Transformer

Author: Maximilian Schambach

Introduction

After their tremendous success in Natural Language Processing, the Transformer architecture has become increasingly popular and useful also in Computer Vision. The Vision Transformer has sparked a surge of interest in Transformers applied to applications in vision since its introduction in October 2020 (Dosovitskiy et al. 2021) as shown in Figure 1. Coming to its two-year anniversary, we take this opportunity to give a brief introduction to the Vision Transformer and discuss a selection of its variants which have evolved in the past two years as well as the challenges arising when applying Transformers to vision tasks.

Figure 1: Publication timeline of a selection of Vision Transformers. Throughout, we reference the official peer-reviewed publications but show the initial publication date of the arXiv preprints here to provide a less distorted picture. See References section for details.

Self-attention and the Transformer architecture

In the context of Natural Language Processing (NLP), the Transformer architecture was introduced in the infamous paper “Attention is all you need” back in 2017 (Vaswani et al. 2017). At its core, the Transformer is a sequence-to-sequence model: It takes as input a sequence of so-called tokens, which in NLP are mathematical representations of an input sentence. To obtain the input sequence, each (sub-)word of a sentence is mapped to some vector representation, called its embedding. This sequence of embedding tokens is then processed by the Transformer encoder in several self-attention and fully connected layers which output a sequence of high-level token representations of the same length as the input. In some tasks, such as translation, the full output sequence is used, while others require a single representation, e.g. for classification. To this end, one typically prepends a special learnable [CLS] token to the sequence whose output is used to obtain a single representation of a full sentence or paragraph. For example, the representation of the [CLS] token can be passed to a classifier head.

An overview of the Transformer architecture is shown in Figure 2. Here, we focus on the encoder part of the architecture as shown on the left side of Figure 2. In particular, the self-attention mechanism is crucial to the Transformer. In a self-attention layer, all intermediate token representations interact with each other by calculating their pairwise similarities which are then used as weights for the corresponding token representations of the previous layer. To do so, one usually calculates so-called queries Q, keys K, and values V from the embeddings via simple learnable linear layers. A bidirectional softmax self-attention is then calculated between all keys and all queries to obtain the respective similarities which are then multiplied with the values. That is, in the self-attention layer every token is compared with every other token via a simple dot-product of their respective key and query vectors followed by a (scaled) softmax operation. In order to be able to learn different forms of self-attention at each stage, multiple self-attention heads are used as shown on the right in Figure 2. Specifically, each head maps different queries, keys, and values from the input representations. For example, at some fixed level of the Transformer, different attention heads could focus on short and long range or semantic and syntactic relationships between the tokens. The outputs of the different attention heads are concatenated and processed by a linear layer in order to regain the dimension of the input representations which are passed via skip connections. For readers unfamiliar with the basic concepts of Transformers, we refer them to the original paper for more details.

**Figure 2:** The Transformer encoder and self-attention mechanism. (Original images from Vaswani et al. 2017).

The Vision Transformer

Considering the astonishingly fast pace at which deep learning research is currently conducted, it took a comparably long time of three years after its introduction to NLP for the first full-Transformer architecture to achieve state-of-the-art performance in Computer Vision (ImageNet classification to be precise). From an abstract point of view, this may not seem too surprising. After all, the Transformer is a sequence model which is not directly compatible with high-dimensional quasi-continuous input types such as images or video. That is, unlike sequences, images or videos can be thought of as being discretely sampled from an underlying continuous signal. Furthermore, the Transformer lacks inductive biases such as locality or translational equivariance which are typically assumed to be useful when dealing with image-like data. Learning these properties using a Transformer arguably requires a substantial amount of training data. In contrast, these biases are implemented in convolutional neural networks (CNNs) by design and have largely contributed to their enormous success in vision applications in the past decade¹. On the other hand, one can argue that the lack of these inductive biases results in a more general architecture. For example, Transformers are able to attend globally already in early layers, whereas CNNs can extract global information only in layers very deep in the network due to their limited receptive fields.

From a practical standpoint, the reason is also related to the poor scaling of the Transformer with respect to the input sequence length: due to the standard bidirectional self-attention mechanism, in which all token representations are compared pairwise, a standard Transformer has quadratic complexity with respect to the input token sequence length both in time as well as memory. Consequently, using each pixel of an image as an input token, even small images of, for example, 256 × 256 resolution already yield extremely long sequences of about 65.000 tokens, rendering this naive approach unfeasible.

To overcome these limitations, the Vision Transformer (ViT) (Dosovitskiy et al. 2021), shown in Figure 3, uses a simple (yet in some sense ruthless) approach:

As a first step, an input image of size X × X × C is spatially patched into M × M patches of a fixed size P × P (for simplicity, we assume square inputs here). Each patch, i.e. a small crop from the image of size P × P × C, is then flattened into a single vector of size P ⋅ P ⋅ C which is then mapped to an embedding of dimension D using a trainable linear layer. The resulting vectors are the patch embeddings which are viewed as the input tokens and assumed to represent the information contained within each small image patch. Finally, the array of M × M patch embeddings is flattened and used as the sequence of tokens of length N = M ⋅ M which is fed into the Transformer. Equivalently, one can interpret the patching, flattening, and linear projection as a simple 2D convolution using D kernels of size P × P × C with (P, P) stride and without any padding.

**Figure 3:** The Vision Transformer architecture. Only the output representation of the [CLS] token is used for classification and supervised training. (Image by Dosovitskiy et al. 2021.)

In full analogy to the standard Transformer in NLP, the embedding tokens, together with a prepended [CLS] token, are passed to a Transformer encoder consisting of several multi-headed self-attention and MLP layers as shown on the right in Figure 3. In the standard ViT, the latent dimension D is fixed throughout. Since the output of the [CLS] token is used as the input to the classification head, its representation is assumed to incorporate all necessary information about the input image, i.e. the output representation of the [CLS] token is used as the latent image representation similar to, for example, the bottleneck representation of a CNN autoencoder. Being trained on a supervised classification task, the [CLS] token interacts with the image-dependent token embeddings.

For example, the standard ViT (ViT-B) uses a patch size of P = 16 resulting in 4 ⋅ 4 = 16 patches in the case of an RGB image of size 64 × 64 × 3. Each patch is flattened to a vector of length 16 ⋅ 16 ⋅ 3 = 768 which is mapped to a vector of dimension D = 768 using a linear layer. The input token sequence of length 1 + 16 is passed through 12 layers of encoder blocks with 12 heads in each self-attention module. This results in an architecture with roughly 86M parameters, while the larger variants, ViT-L and ViT-H, have 300M and 600M parameters, respectively. While the base model is comparable in size to a standard ResNet-152 architecture (60M parameters), state-of-the-art CNNs are more comparable to the large and huge variant of ViT. For example, ResNet-152x4 has over 900M parameters whereas Efficientnet-L2 has roughly 500M parameters, which are the CNN architectures that Dosovitskiy et al. compare ViT with.

Overall, ViT, in particular its large and huge variants, achieves state-of-the-art results when pretrained on the massive JFT-300M dataset and transferred to small- or medium-sized datasets such as ImageNet or CIFAR. As opposed to large CNN-based vision backbones, ViT requires less resources for training. Nevertheless, these requirements are still substantial and outside the scope of most small to mid-scale research facilities. Furthermore, while their approach makes it feasible to transfer the Transformer to vision tasks in principle, one still faces some challenges which have been tackled by the original ViT paper as well as subsequent publications.

Challenges of Transformers in the vision context

First, a problem that one encounters is that the Transformer is invariant with respect to permutations of the input tokens, i.e. the Transformer has no notion of any positional relationship between them. For Transformers in NLP, where the 1D positional relationships of the tokens in the sequence, i.e. in which order words appear in a sentence, are important, this is resolved by using a positional encoding of the token embeddings. To encode the position into a token embedding, a positional embedding — simply a vector of dimension D generated for each token position individually — is added to (or, in some architectures, concatenated with) the token embedding before passing it to the Transformer. These positional embeddings can be learnable or fixed, for example, generated as sinusoids of different frequencies sampled at different locations depending on the position of the token under consideration (Vaswani et al. 2017). These positional embeddings are also referred to as Fourier features. In this case, the positional embeddings can be generated for arbitrary long sequences on the fly, assuring that the architecture can deal with sentences of different length.

In the case of ViT, however, it is a little more tricky as the Transformer has to regain the 2D positional information of the image patches, i.e. where in the image the patches were extracted from. To do so, ViT uses learnable 1D positional embeddings for each of the N input tokens, which are simply a collection of N learnable vectors of dimension D. During training, there is no explicit supervision of these embeddings, i.e. the ViT is never given a notion of the true 2D relationship of its input. Yet, Dosovitzky et al. show, by calculating their pairwise similarity, that the positional embeddings learned by ViT indeed contain the 2D positional relationship of the corresponding patches — an important and remarkable result underlining the capacity and generality of the Transformer architecture.

However, in this approach, the length N of the input sequence needs to be fixed during training. Since the patch size needs to be fixed as well (as the linear embedding layer would change otherwise), this implies that training has to be done using input images at a fixed resolution. While this is, in theory, not the case with CNNs, it is so in practice as mini-batch training requires the input examples to be of the same resolution within each mini-batch. However, unlike CNNs, care has to be taken in order for ViT to generalize to larger (or smaller) input resolutions at inference, again assuming a fixed patch size, because larger input resolutions lead to longer input token sequences.

Consider the example in Figure 4. Using the patch sizes of 16 × 16 from above, a 128 × 128 image is patched into 8 × 8 = 64 patches as opposed to the 16 patches for a 64 × 64 image seen during training. Since only 16 positional embeddings were learned during training in this example, they have to be generalized to different resolutions while keeping the original 2D relationship, which they represent, intact. Luckily, since positional embeddings that share a 2D positional relationship are indeed found to be similar, we can use a 2D interpolation of the embeddings to generalize to arbitrary input resolutions while retaining the learned positional information as depicted in Figure 4. While this approach, proposed by Dosovitskiy et al., seems reasonable, it is not quantitatively evaluated in the original publication in which all images are downscaled to a common resolution of 224 × 224 for training and a higher resolution for fine-tuning which is common practice.

**Figure 4:** Resampling of the positional embeddings from 4 × 4 to 8 × 8 patches.

Second, the positional relationship between pixels within each patch is lost and dense predictions, such as the pixel-wise classification in a semantic segmentation task, can only be performed natively at patch level. In that sense, there is a tradeoff between expressiveness and throughput: while small patch sizes, down to 1 x 1 px in the limit, retain more of the underlying spatial relationship of the original image and yield dense latent representations, the throughput suffers and the memory demand of the Transformer becomes infeasible. Earlier works have therefore, for example, limited the attention mechanism to local pixel neighborhoods (Parmar et al. 2018) or were limited to very small input resolutions using downsampling (Chen et al. 2020, Cordonnier et al. 2020). In the large-patch-size limit, i.e. when a single patch spans the full image, no positional information can be utilized and the token embedding layer becomes the bottleneck of the architecture which greatly limits its capacity.

Similarly, the relationship of neighboring pixels that are split by patch borders is lost. Since the patching is performed in an image-agnostic way and is therefore arbitrary to some degree, patches group together pixels that may not share contextual similarities. Therefore, important features in the input may not be directly usable when separated by patches or have to be recombined from the corresponding token embeddings. This increases the difficulty of the downstream task and the demands on the patch embedding layer. This is also related to the lack of translational equivariance.

Finally, as previously mentioned, the standard Transformer architecture scales quadratically with the input sequence length which prohibits the application of ViT to high-resolution images and other higher-dimensional image-like data as well as the usage of small patch sizes. To improve upon these challenges, many variants have been proposed swiftly after the initial publication of ViT some of which we discuss in more detail in the second part.

Conclusion

In this first part of our two-part series revisiting Vision Transformers, we have briefly discussed the basics of the Transformer architecture and challenges arising when applying it to applications in Computer Vision. The standard Vision Transformer, as proposed by Dosovitskiy et al. two years ago, tackled some of these challenges and achieved impressive results on ImageNet classification and marked the beginning of intense research. As a general-purpose architecture, Transformers have since been intensively studied and further developed. We will discuss some of the recent developments, such as DeiT and Swin Transformers, in the second part of this series.

Footnotes

¹Coincidentally, this month also marks the 10-year anniversary of AlexNet, the first CNN to win the ImageNet challenge, effectively sparking a decade of intense CNN research.

References

(DeiT) Hugo Touvron et al.: “Training data-efficient image transformers & distillation through attention.” In: International Conference on Machine Learning (ICML), 2021.

(DINO) Mathilde Caron et al.: ”Emerging properties in self-supervised vision transformers.” In: International Conference on Computer Vision, 2021.

(FlashAttention) Tri Dao et al.: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” In: arXiv:2205.14135, 2022.

(Perceiver) Andrew Jaegle et al.: “Perceiver: General Perception with Iterative Attention.” In: International Conference on Machine Learning (ICLR), 2021.

(Perceiver IO) Andrew Jaegle et al.: “Perceiver IO: A General Architecture for Structured Inputs & Outputs” International Conference on Machine Learning (ICLR), 2022.

(SOFT) Jiachen Lu et al.: “SOFT: Softmax-free Transformer with Linear Complexity.” In: Advances in Neural Information Processing Systems (NeurIPS), 2021.

(STEGO) Mark Hamilton et al.: “Unsupervised Semantic Segmentation by Distilling Feature Correspondences.” In: International Conference on Learning Representations, 2022.

(SWIN) Ze Liu et al.: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” In: International Conference on Computer Vision (ICCV), 2021.

(TNT) Kai Han et al.: “Transformer in Transformer.” In: Advances in Neural Information Processing Systems (NeurIPS), 2021.

(XCiT) Alaaeldin Ali et al: XCiT: “Cross-Covariance Image Transformers.” In: Advances in Neural Information Processing Systems (NeurIPS), 2021.

(Brown et al. 2020) Tom Brown et al.: “Language Models are Few-Shot Learners.” In: Advances in Neural Information Processing Systems (NeurIPS), 2020.

(Chen et al. 2020) Mark Chen et al.: “Generative Pretraining From Pixels.” In: International Conference on Machine Learning (ICML), 2020.

(Cordonnier et al. 2020) Jean-Baptiste Cordonnier et al.: “On the Relationship between Self-Attention and Convolutional Layers.” In: International Conference on Learning Representations (ICLR), 2020.

(Devlin et al. 2019) Jacob Devlin et al.: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In: Conference of the North American Chapter of the Association for Computational Linguistics (NAC-ACL), 2019.

(Dosovitskiy et al. 2021) Alexey Dosovitskiy et al.: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” In: International Conference on Learning Representations (ICLR), 2021.

(Parmar et al. 2018) Niki Parmar et al.: “Image Transformer.” In: International Conference on Machine Learning (ICML), 2018.

(Ronneberger et al. 2015) Olaf Ronneberger et al.: “U-net: Convolutional networks for biomedical image segmentation.” In: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.

(Vaswani et al. 2017) Ashish Vaswani et al.: “Attention is All you Need.” In: Advances in Neural Information Processing Systems (NeurIPS), 2017.

A Brief History of Vision Transformers: Revisiting Two Years of Vision Research