Vision Transformer ~ Am Image is Worth 16X16 Words: Transformers for Image Recognition at Scale

Christian Lin

5 min readOct 8, 2023

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its…

arxiv.org

GitHub - google-research/vision_transformer

Contribute to google-research/vision_transformer development by creating an account on GitHub.

github.com

In the ever-evolving landscape of machine learning, the “Vision Transformer” (ViT) marks a significant departure from conventional wisdom. Traditionally, convolutional neural networks (CNNs) have reigned supreme in visual tasks, tailored to process images by focusing on local spatial hierarchies. The Vision Transformer, as presented in the groundbreaking paper, challenges this norm. Rather than leaning on convolutions, ViT employs a transformer architecture, predominantly used for natural language processing tasks until now. The model divides images into fixed-size patches, linearly embeds them, and then processes them sequentially using the transformer’s self-attention mechanism. Remarkably, when provided with ample data and computational power, ViT not only rivals but often surpasses state-of-the-art CNNs on major vision benchmarks. Its success heralds a potential shift in computer vision, underscoring the versatility and power of the transformer architecture beyond textual data.

Preliminary

Historical context:

Computer vision’s evolutionary journey has been predominantly marked by the ascendancy of convolutional neural networks (CNNs). These networks, with their specialized design, focus on local spatial hierarchies — a feature that has proven to be indispensable for interpreting image data. Over the years, from academic benchmarks like ImageNet to myriad real-world applications, CNNs have consistently established and broken performance records, unequivocally emerging as the go-to architecture for visual tasks. Figure Suggestion: A timeline illustrating the evolution of image recognition models leading up to CNNs.

The Transformer Revolution:

In the realm of natural language processing, a parallel revolution unfolded with the introduction of the Transformer architecture. Originating from the seminal “Attention is All You Need” paper, the Transformer’s self-attention mechanism offered dynamic, contextually rich sequence representations. In a short span, it unseated many existing models, quickly becoming the de facto standard for processing textual data.

If you want to recap the knowledge background of Transformer, you can reference the following previous article.

Transformer ~ Attention is All You Need

In the groundbreaking paper “Attention is All You Need,” researchers from Google introduced the Transformer…

medium.com

Bridging Two Worlds:

Merging the worlds of vision and text, the “Vision Transformer” paper dared to explore uncharted waters. It postulated a provocative question: Is it feasible for the transformer, inherently designed for sequential textual data, to challenge, or even surpass, the feats achieved by CNNs in visual domains? The answer is shown in paper. Yes, it is possible!!!

The Vision Transformer’s methodology embodies elegance and innovation. Instead of the traditional approach of processing images pixel-by-pixel or via local convolutional filters, the Vision Transformer adopts a patch-based mechanism. An image is dissected into a grid of fixed-sized patches. These patches, once transformed into linear embeddings, are sequenced and processed by the transformer. This paradigm shift transforms the perception of images from a 2D spatial entity to a sequence of informational patches.

Methodology

From Image to Sequence:

At the heart of the Vision Transformer (ViT) lies a radical departure from traditional image processing. Instead of treating images as 2D arrays of pixels, ViT interprets them as sequences. The first step involves partitioning each image into fixed-sized patches, akin to chopping up a picture into uniform tiles.

An illustration of an image being divided into uniform patches.

Linear Embeddings & Positional Information:

Each of these patches is then linearly embedded into a flat vector. To retain the spatial configuration and position of each patch within the image, positional embeddings are added. This transformed sequence of vectors, now holding both visual and positional information, is ready to be processed by the transformer.

Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Similarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position embedding of the patch with the indicated row and column and the position embeddings of all other patches. Right: Size of attended area by head and network depth. Each dot shows the mean attention distance across images for one of 16 heads at one layer. See Appendix D.7 for details.

Leveraging the Transformer Architecture:

The core processing engine for these sequences is the transformer architecture, renowned for its efficacy in natural language processing tasks. The self-attention mechanism within the transformer processes these sequences, ensuring each patch is contextually aware of every other patch in the image.

Classification Head & Output:

After traversing through the transformer’s layers, the sequence reaches the classification head. ViT uses the representation corresponding to the “CLASS” token (a special token introduced at the beginning of the sequence) to generate the final output, usually through a linear layer followed by a softmax for classification tasks.

An illustration showcasing the ‘class’ token’s representation being processed to produce the final class probabilities.

Experiments

Comparison to SOTAs:

They wanted to see how two biggest models, ViT-H/14 and ViT-L/16, perform when compared with the best image-recognition models out there. They first looked at Big Transfer (BiT), which is a model that learns by using other large image-recognition models. The next one is Noisy Student, which is trained on a mix of labeled and unlabeled images. Right now, Noisy Student is the best at recognizing images from the ImageNet dataset.

In the following table, ViT-L/16 model was better than BiT-L, even though they used the same training data, and the proposed model was quicker to train. The ViT-H/14 model did even better, especially with tougher datasets like ImageNet and CIFAR-100. What’s amazing is that it still trained faster than other top models. But remember, many factors can impact training speed, like how the training is set up, which computer algorithms are used, and more.

Comparison with state of the art on popular image classification benchmarks.

In this section we provide the big picture of Vision Transformer model. If you want to know more detail about ViT, you can follow the links shown as follow to further understand the related foundation and knowledge.

In this article, I briefly share my viewpoints on the paper. I hope you can learn more about it after reading it. I also offer the video link about the paper, hope you guys like it!!!!

If you like the article, please give me some 👏 , share the article, and follow me to learn more about the world of multi-agent reinforcement learning. You can also contact me on LinkedIn, Instagram, Facebook and Github.