A Brief History of Vision Transformers
Revisiting Two Years of Vision Research

Merantix Momentum

Published in

Merantix Momentum Insights

11 min readNov 17, 2022

Part 2: Current Developments

Author: Maximilian Schambach

Introduction

In part one of our two-part series on Vision Transformers, we have covered the basics of self-attention and the Vision Transformer (ViT) as introduced by Dosovitskiy et al. (Dosovitskiy et al. 2021). Furthermore, we discussed some of the challenges when applying the Transformer architecture in the vision context. Now, we will present some of ViT’s variants that have been proposed since its introduction two years ago, further pushing state of the art in computer vision and tackling some of the aforementioned challenges.

Data-Efficient Image Transformer (DeiT)

By using a more elaborate training strategy, including strong data augmentation, Touvron et al. are able to achieve competitive performance (again, with respect to ImageNet classification) without relying on a huge proprietary dataset as opposed to ViT, which was trained on Google’s closed-source JFT-300M dataset. They further improve their results using a novel Transformer-specific distillation technique (Touvron et al. 2021). This way, the knowledge of a strong, possibly large or difficult-to-train teacher model is distilled into a Transformer-based student in the line of early works by Hinton et al. (Hinton et al. 2014). To this end, their approach, dubbed Data-Efficient Image Transformer (DeiT), introduces an additional distillation token which plays a role analogous to the [CLS] token whose representation is used for classification. Similarly, as shown in Figure 1, the output representation of the distillation token is used as input to an additional classification head that is trained to predict the output label of the teacher. As usual, the [CLS] token is passed to (another) classifier head that is trained to predict the ground truth class labels. The total loss function is then a weighted mean of the conventional cross-entropy loss using the ground truth labels as well as a loss term based on the Kullback Leibler divergence of the distillation classifier head’s output logits with respect to those of the teacher model1. This way, the Transformer student model can utilize the teacher model¹ to speed up and enhance its training.

**Figure 1:** Feature distillation in DeiT. (Image by Touvron et al. 2021)

Furthermore, strong augmentations may lead to invalid ground truth labels, for example, when an object corresponding to a label is not present in an augmented crop of an image, as shown in Figure 2. Using a loss based not solely on the ground truth but also on the labels predicted by the teacher can mitigate this problem. Also, some images may have ambiguous class labels since each image is associated with exactly one label despite possibly containing multiple objects. For example, they use a strong CNN classifier that boosts the student’s baseline performance significantly, likely by being able to utilize the inductive biases incorporated in the teacher model without the need for an excessive amount of training data to train the Transformer from scratch. In fact, the distilled model outperforms the teacher and achieves state-of-the-art results on ImageNet classification, closing the gap between Transformer-based models and CNNs. Speeding up the training and using a strong teacher to distil the knowledge into a moderately sized Vision Transformer — the ViT base model has roughly 86M parameters comparable to a standard ResNet-152 — DeiT can be trained on ImageNet using a 4-GPU-node in three days. While its performance is inferior to those ViT models pre-trained on much larger datasets such as JFT-300M, the computational costs are much smaller.

**Figure 2:** Example image from the ImageNet dataset (left) for which cropping can result in a different label (right).

Shifted Windows (Swin) Transformer

To overcome the partial loss of spatial information at patch borders as well as tackling the quadratic computational complexity of ViT, Liu et al. introduce a Vision Transformer based on hierarchical feature maps and shifted windows (Swin) (Liu et al. 2021). Dealing with two distinct problems, there are two core contributions of the Swin Transformer, which are schematically depicted in Figure 3.

**Figure 3:** Hierarchical feature maps (top) and window shifting of subsequent layers (bottom) as used by SWIN. (Image by Liu et al. 2021)

First, to reduce complexity, the calculation of self-attention is limited to non-overlapping local windows containing a few patches, 7 × 7 by default, each. Note that, in Figure 3, the windows are depicted to include 4 × 4 patches. To be able to incorporate non-local, large-range spatial relationships in the input image, the patch sequence is downsampled after each stage of the Swin Transformer which has four stages in total. To do so, 2 × 2 patches are concatenated and downsampled using a trainable linear layer while increasing the feature dimension by two, resulting in an effective downsampling by a factor of two. This way, a feature pyramid is built which can be used in tasks that require both dense local as well as global features, similar to those generated by a conventional U-Net architecture (Ronneberger et al. 2015). Similarly, image pyramids have been used in classical image processing techniques such as SIFT (Lowe 1999) and SURF (Bay et al. 2008) to obtain scale-invariant features. By default, the Swin Transformer does not process a [CLS] token but instead uses the averaged features of the last stage to feed into a classification head which is used to train the architecture in a supervised manner on ImageNet. Furthermore, in each stage, two different layouts are used to partition the patches based on shifted windows as shown in Figure 3. Here, the shifted windows now contain patches that were previously separated and thus excluded from the attention calculation. They show that the usage of shifted windows gives a significant performance boost over static windows on ImageNet classification, object detection on COCO, as well as semantic segmentation on ADE20k. Overall, memory and computational complexity is reduced from quadratic to linear, since the number of patches within each window is kept fixed, while the architecture yields higher performance than ViT and DeiT.

Besides the Swin Transformer, there are numerous approaches to reduce the complexity of Transformers based on the standard bidirectional softmax self-attention, both in NLP as well as Computer Vision research. Approaches range from using a sparse or low-rank approximation of the softmax self-attention (Wang et al. 2020, Zaheer et al. 2020, Kitaev et al. 2020, Choromanski et al. 2021), altering the attention mechanism (Ali et al. 2021, Jaegle et al. 2021, Lu et al. 2021, Jaegle et al. 2022), to optimizing the standard approach, for example with respect to IO operations (Dao et al 2022), reducing the complexity from quadratic to linear. For an in-depth comparison of different Transformer architectures, we refer to the fairly recent Long Range Arena benchmark (Tay et al. 2021).

Self-distillation with no labels (DINO)

The Vision Transformers discussed so far were all (pre-)trained on classification tasks in a supervised fashion. In a different line of research, building upon DeiT, Vision Transformers have also shown promising results when used with self-supervised training techniques. Eliminating the need for an explicit teacher model which DeiT uses, Caron et al. introduce a method for self-distillation with no labels (DINO) (Caron et al. 2021).

**Figure 4:** The DINO architecture. Animation by Meta-AI from the corresponding blog post. [https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training/]

As shown in Figure 4, the teacher is defined as the exponential moving average of the student, which is a standard Vision Transformer. Unlike common self-supervised training strategies in NLP which rely on either masking or completing the input token sequence (Devlin et al. 2019, Brown et al. 2020), self-supervised learning for vision typically involves contrastive losses in order to avoid collapse². In the contrastive learning framework, two versions of the network are presented with augmentations of the same or different images and are trained to output similar or dissimilar representations, respectively. However, similar to other recent approaches in self-supervised learning such as BYOL (Grill et al. 2020), Barlow Twins (Zbontar et al. 2021) or VICReg (Bardes et al. 2022), DINO does not make explicit use of negative samples via a contrastive loss. Instead, two augmented views of the same input image are passed to the student and teacher. While the teacher is presented with a global crop of the image spanning a larger region of the input, the student is given a local crop corresponding to a smaller region. The predicted softmax logits, obtained from the output representation of the [CLS] token using a simple MLP projection head, are then compared with the teacher’s logits using a cross entropy loss. The gradients are propagated solely through the student network. While there are some other engineering tricks involved to avoid collapse, the core idea is that the student predicts a representation of an image that is similar to the prediction of its more stable variant — the teacher — of a slightly different view of the image while promoting local-to-global correspondances. Using this self-distillation approach by training the network on ImageNet images (without using the labels), they show that the representation of images of the same label naturally cluster without any supervision. They are able to achieve impressive results on unsupervised ImageNet classification via a simple KNN classifier using the latent image representations. The authors further show that DINO effectively learns representations that contain information about the semantic segmentation of the input as shown in Figure 5. To provide segmentation maps with reasonable resolution, DINO uses a much smaller patch size of up to 4 × 4 as compared to ViT and DeiT. Based on this naturally semantic clustering of DINO representations, recent works have explored self-supervised semantic segmentation (Hamilton et al. 2022) which we will discuss in a future blog post.

**Figure 5:** Segmentation maps obtained from supervised training of a standard vision backbone (top) and using DINO’s unsupervised approach (bottom). (Image by Caron et al. 2021)

Conclusion

Having been the go-to choice in Natural Language Processing for some years, Transformers are slowly superseding Convolutional Neural Networks as the state of the art in vision as well. In particular, large to huge Transformers models such as ViT-L and ViT-H, also dubbed foundation models, have shown great results when used as general-purpose vision backbones, especially when trained on huge datasets. Despite their recent success, Transformers come with distinct challenges when applied to images or videos, as we have outlined, and remain an active area of research. In particular, strategies for sample-efficient self-supervised learning as well as methods to adapt Transformers to more specific domains or datasets, for example, those encountered in industrial applications, remain open challenges and will likely spur more research in applied vision as well.

While many influential Vision Transformer architectures were originally developed and published by tech giants such as Google (ViT), Microsoft (Swin), Huawei (TNT), OpenAI (iGPT) or Meta (DINO) using immense computational resources, there has been a lot of research effort trying to reduce the complexity and requirements on the training hardware and data needed to train Vision Transformers from scratch. While we discussed some of these, we cannot cover the vast amount of research published in this area. Still, it is likely that Vision Transformers become feasible to train from scratch for more and more vision applications making them interesting for a variety of more specific applications and vision practitioners.

Footnotes:

¹We are simplifying the procedure in the presentation here. In fact, Touvron et al. investigate several distillation techniques based on either the hard teacher labels, i.e. the argmax of the softmax logits, as well as approaches utilizing soft target labels or label smoothing.

²However, masked Vision Transformers have also been investigated recently (He et al. 2022, Bachmann et al. 2022).

References

(Ali et al. 2021) Alaaeldin Ali et al: XCiT: “Cross-Covariance Image Transformers.” In: Advances in Neural Information
Processing Systems (NeurIPS), 2021.

(Bachmann et al. 2022) Roman Bachmann et al.: “MultiMAE: Multi-modal Multi-task Masked Autoencoders.” In: arXiv:2204.01678, 2022.

(Bardes et al. 2022) Adrien Bardes et al.: “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.” In: International Conference on Learning Representations, 2022.

(Bay et al. 2008) Herbert Bay et al.: “Speeded-up robust features (SURF).” In: Computer Vision and Image Understanding 110.3, 346–359, 2008.
(Brown et al. 2020) Tom Brown et al.: “Language Models are Few-Shot Learners.” In: Advances in Neural Information Processing Systems (NeurIPS), 2020.

(Caron et al. 2021) Mathilde Caron et al.: ”Emerging properties in self-supervised vision transformers.” In: International Conference on Computer Vision, 2021.
(Chen et al. 2020) Mark Chen et al.: “Generative Pretraining From Pixels.” In: International Conference on Machine Learning (ICML), 2020.

(Cordonnier et al. 2020) Jean-Baptiste Cordonnier et al.: “On the Relationship between Self-Attention and Convolutional Layers.” In: International Conference on Learning Representations (ICLR), 2020.
(Choromanski et al. 2021) Krzysztof Marcin Choromanski et al.: “Rethinking Attention with Performers.” In: International Conference on Learning Representations (ICLR), 2021.
(Dao et al 2022) Tri Dao et al.: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” In: arXiv:2205.14135, 2022.

(Devlin et al. 2019) Jacob Devlin et al.: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In: Conference of the North American Chapter of the Association for Computational Linguistics (NAC-ACL), 2019.
(Dosovitskiy et al. 2021) Alexey Dosovitskiy et al.: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” In: International Conference on Learning Representations (ICLR), 2021.

(Grill et al. 2020) Jean-Bastien Grill et al.: “Bootstrap Your Own Latent — A New Approach to Self-Supervised Learning.” In: Advances in Neural Information Processing Systems (NeurIPS), 2020.

(Hamilton et al. 2022) Mark Hamilton et al.: “Unsupervised Semantic Segmentation by Distilling Feature Correspondences.” In: International Conference on Learning Representations, 2022.
(Han et al. 2021) Kai Han et al.: “Transformer in Transformer.” In: Advances in Neural Information Processing Systems (NeurIPS), 2021.

(He et al. 2022) Kaiming He et al.: “Masked autoencoders are scalable vision learners.” In: Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

(Hinton et al. 2014) Geoffrey Hinton et al.: “Distilling the Knowledge in a Neural Network.” In: Advances in Neural Information Processing Systems Workshop, 2014.

(Jaegle et al. 2021) Andrew Jaegle et al.: “Perceiver: General Perception with Iterative Attention.” In: International Conference on Machine Learning (ICLR), 2021.

(Jaegle et al. 2022) Andrew Jaegle et al.: “Perceiver IO: A General Architecture for Structured Inputs & Outputs” International Conference on Machine Learning (ICLR), 2022.

(Kitaev et al. 2020) Nikita Kitaev et al.: “Reformer: The Efficient Transformer.” International Conference on Learning Representations (ICLR), 2020.

(Liu et. al, 2021) Ze Liu et al.: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” In: International Conference on Computer Vision (ICCV), 2021.

(Lowe 1999) David G. Lowe: “Object recognition from local scale-invariant features.” In: International Conference on Computer Vision (ICCV), 1999.
(Lu et al. 2021) Jiachen Lu et al.: “SOFT: Softmax-free Transformer with Linear Complexity.” In: Advances in Neural Information Processing Systems (NeurIPS), 2021.

(Parmar et al. 2018) Niki Parmar et al.: “Image Transformer.” In: International Conference on Machine Learning (ICML), 2018.

(Ronneberger et al. 2015) Olaf Ronneberger et al.: “U-net: Convolutional networks for biomedical image segmentation.” In: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.
(Tay et al. 2021) Yi Tay et al.: “Long Range Arena : A Benchmark for Efficient Transformers.” In: International Conference on Machine Learning (ICLR), 2021.

(Touvron et al. 2021) Hugo Touvron et al.: “Training data-efficient image transformers & distillation through attention.” In: International Conference on Machine Learning (ICML), 2021.

(Vaswani et al. 2017) Ashish Vaswani et al.: “Attention is All you Need.” In: Advances in Neural Information Processing Systems (NeurIPS), 2017.

(Wang et al. 2020) Sinong Wang et al.: “Linformer: Self-Attention with Linear Complexity.” In: arXiv:2006.04768, 2020.
(Zaheer et al. 2020) Manzil Zaheer et al.: “Big Bird: Transformers for Longer Sequences.” In: Advances in Neural Information Processing Systems (NeurIPS), 2020.

(Zbontar et al. 2021) Jure Zbontar et al.: “Barlow Twins: Self-Supervised Learning via Redundancy Reduction.” In: International Conference on Machine Learning (ICML), 2021.

A Brief History of Vision TransformersRevisiting Two Years of Vision Research