An Image is Worth 256 Words

Vision Transformer: The Transition of the Transformer Model from Language to Image

7 min readSep 15, 2023

Convolutional Neural Networks (CNN) have long been the dominant approach to image processing. However, a new contender has emerged: Vision Transformer (ViT). ViTs are based on the transformer architecture, which has been used for Natural Language Processing (NLP) for many years.

In this article, we will go over the following topics to get a better understanding of ViT, and why an image is worth 2⁸ words, instead of 2¹⁰ (see what I did there?):

What is the Transformer model?
The inception of ViT
Architecture and results
ViT versus CNN

🤖 What is a Transformer Model?

A transformer model is a type of neural network architecture that was introduced in 2017 by Vaswani et al. in the paper Attention is All You Need. Transformer models were originally developed for NLP tasks, such as machine translation and text summarization.

The Transformer Architecture — from Lil’Log

Furthermore, Transformer models are particularly well-suited for Large Language Models (LLM) because they allow the models to learn long-range dependencies in text. This is important because they need to be able to understand the context of a sentence in order to generate text that is coherent and meaningful.

LongNet: To 1 Billion Tokens and Beyond

LongNet is a new variant of the Transformer model that enables the modeling of extremely long sequences of text, with a…

medium.com

For more information on how this process is done, you can check out this article, which dives into the topic of expanding the size of these LLMs via a new Transformer varient.

👀 The Inception of Vision Transformer (ViT)

Originally published by Dosovitskiey et al. in 2021 with the title of “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, the researchers at Google Research Brain Team introduced the Vision Transformer (ViT) model architecture.

ViT is a pure transformer applied directly to sequences of image patches and achieves highly competitive performance in benchmarks for several computer vision applications, such as:

Image Classification: Image classification involves assigning a label or tag to an entire image based on preexisting training data of already labeled images. (by SuperAnnotate)
Object Detection: Object detection is a computer vision technique for locating instances of objects in images or videos. (by MathWorks)
Semantic Image Segmentation: The goal of semantic image segmentation is to label each pixel of an image with a corresponding class of what is being represented. (by Jeremy Jordan)

Furthermore, to find the code or the GitHub repository for this project, you can visit this link.

GitHub - google-research/vision_transformer

Contribute to google-research/vision_transformer development by creating an account on GitHub.

github.com

🏗️ Architecture of ViT

ViT follows the original Transformer architecture, however, some changes are needed to customize the model for image processing. Here is the general flow:

Split the input image into N × N patches (N can be 14 or 16 pixels)
Flatten the image patches into a sequence of flattened 2D patches
Map the sequence of flattened patches to a fixed latent vector size D with a trainable linear projection (i.e., patch embeddings)
Feed the sequence of patch embeddings into the transformer, which applies multiple layers of self-attention and Feedforward Neural Networks (FNN) to process the input sequence
Pass the output of the transformer through a classification head to produce the final output
Pre-train the ViT model on large datasets (e.g., ILSVRC-2012 ImageNet, ImageNet-21k, and JFT)
Fine-tune the pre-trained ViT model on smaller downstream tasks by removing the pre-trained prediction head and attaching a zero-initialized D × K feedforward layer (K is the number of downstream classes)
Experiment with applying a standard Transformer directly to images (with the fewest possible modifications)
Train the model on image classification in a supervised fashion

The architecture of Vision Transformer — from the paper

Also, ViT has several variants, which differ in the number of layers, attention heads, and feedforward dimensions, as well as the input patch size. These differences allow for a trade-off between model size and computational resources and can be used to optimize performance on different image classification tasks.

Table of variants of ViT — from the paper

For more information, you can check out this video by Aleksa Gordić (The AI Epiphany on YouTube) to gain a better understanding of ViT.

📊 Results

By comparing the largest ViT models to state-of-the-art CNNs on several datasets, we can see why ViT might be a better alternative to the previous models.

ViT-L/16 pre-trained on JFT-300M outperforms BiT-L on all tasks while requiring less computation.
ViT-H/14 further improves performance, especially on challenging datasets.
ViT-L/16 pre-trained on ImageNet-21k also performs well, while requiring even less computing power.
ViT-H/14 outperforms BiT-R152x4 on Natural and Structured VTAB tasks.

⚔️ ViT vs. CNN

As seen earlier, the results show that ViTs perform better in various image recognition benchmarks. For example, a ViT model trained on the ImageNet dataset achieved a top-1 accuracy of 88.4%, which is comparable to the performance of a state-of-the-art CNN model.

Comparison of accuracy between ViT and other CNNs — from the paper

🔺 Advantages of ViT

The main advantages of ViTs over CNNs are:

Better scalability: ViTs can scale to larger image sizes and datasets without requiring larger models or more computational resources. This is because ViTs process images as sequences of patches, which can be processed in parallel and with less memory overhead than CNNs.
Better generalization: ViTs can generalize better to new tasks and datasets than CNNs, due to their ability to learn more abstract and context-aware representations of images. This is because ViTs use self-attention mechanisms to capture long-range dependencies between image patches, which can help them learn more meaningful features.
Fewer architectural constraints: ViTs are less constrained by architectural choices than CNNs, which can make them easier to design and optimize for specific tasks. For example, ViTs can use different patch sizes and resolutions, and can be trained end-to-end without requiring pre-defined convolutional filters.
Better transfer learning: ViTs can be pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks, achieving excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Better interpretability: ViTs are more interpretable than CNNs, as the self-attention mechanism allows for visualization of the importance of different patches in the input image. This can help with understanding how the model makes predictions and identifying potential biases or errors.

🔻 Disadvantages of ViT

On the other hand, there are some drawbacks:

Limited spatial information: ViTs process images as sequences of patches, which can result in loss of spatial information compared to CNNs. This can make it harder for ViTs to capture fine-grained details in images, such as object boundaries or textures.
Higher memory requirements: ViTs require more memory to store the attention matrices used in the self-attention mechanism, compared to the feature maps used in CNNs. This can make it harder to scale ViTs to larger image sizes or datasets.
Longer training times: ViTs can take longer to train than CNNs, due to the larger number of parameters and the need to compute attention matrices for each patch. This can make it harder to optimize ViTs for specific tasks or datasets.
Limited interpretability: While ViTs are more interpretable than CNNs, the self-attention mechanism can also make it harder to interpret how the model makes predictions. This is because the attention weights can be influenced by many factors, including the input image, the task, and the pre-training data.

There are a few things you can do to make ViT models work better:

Use a good optimizer
Choose the right network depth
Tune the hyperparameters for your specific dataset

✍️ Conclusion

To recap what we went over in this article: Vision Transformers (ViT) are a new type of image recognition model and have several advantages over traditional CNN models. However, ViTs also have some challenges and drawbacks. Overall, ViTs are a promising new approach to image recognition and are likely to become a more viable alternative to CNNs with more development in the future.

I hope this article was helpful to you. If so, feel free to share this post with others.

You can find me on LinkedIn or GitHub.