Paper Explained- Vision Transformers (Bye Bye Convolutions?)

Nakshatra Singh
Analytics Vidhya
Published in
5 min readOct 17, 2020
Model Overview. The image is taken from the paper.

The Limitation with Transformers For Images

Transformers work really really well for NLP however they are limited by the memory and compute requirements of the expensive quadratic attention computation in the encoder block. Images are therefore much harder for transformers because an image is a raster of pixels and there are many many many… pixels to an image. The rasterization of images is a problem in itself even for Convolutional Neural Networks. To feed an image into a transformer every single pixel has to attend to every single other pixel (just like the attention mechanism), the image itself is 255² big so the attention for an image will cost you 255⁴ which is almost impossible even in current hardware. So people have resorted to other techniques like doing Local Attention and even Global Attention. The authors of this paper adapted to use Global Attention.

Vision Transformer Architecture

Patch Embeddings

The standard Transformer receives input as a 1D sequence of token embeddings. To handle 2D images, we reshape the image x∈R^{H×W×C} into a sequence of flattened 2D patches.

Where, (H, W) is the resolution of the original image and (P, P) is the resolution of each image patch. N = HW/P² is then the effective sequence length for the Transformer. The image is split into fixed-size patches, in the image below, patch size is taken as 16×16. So the dimensions of the image will be 48×48.

NOTE: The image dimensions must be divisible by the patch size.

Basic Intuition of Reshaped Patch Embeddings

Linear Projection of Flattened Patches

Before passing the patches into the Transformer block the authors of the paper found it helpful to first put the patches through a linear projection. So there is one single matrix and it is called E, in this case, “embedding”, HAHA. They take a patch and unroll it into a big vector and multiply it with the embedding matrix to form patched embeddings and that's what goes into the transformer along with the positional embedding.

The intuition of Linear Projection Block before feeding in Encoder

Positional Embeddings

Position embeddings are added to the patched embeddings to retain positional information. We explore different 2D-aware variants of position embeddings without any significant gains over standard 1D position embeddings. The joint embedding serves as input to the Transformer encoder.

Each unrolled patch (before Linear Projection) has a sequence of numbers associated with it, in this paper the authors chose it to 1,2,3,4…. no of patches. These numbers are nothing but learnable vectors. Each vector is parameterized and stacked row-wise to form a learnable positional embedding table.

Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embedded patches, whose state at the output of the Transformer encoder (zₗ⁰) serves as the image representation y. Both during pre-training and fine-tuning, the classification head is attached to zₗ⁰.

Finally, the row number (initially sequenced number) associated with the patched embedding is picked up from the table (as positional embedding), concatenated, and fed to the Transformer encoder block.

a.k.a The Vision Block, Complete Mechanism before the Encoder Block.

The Transformer Encoder Block

The Transformer encoder consists of alternating layers of Multiheaded self-attention and MLP blocks. Layernorm (Layer Normalization) is applied before every block and residual connection after every block.

Encoder Block. The image is taken from the paper.

Hybrid Architecture (A similar Approach)

As an alternative to dividing the image into patches, the input sequence can be formed from intermediate feature maps of a ResNet. In this hybrid model, the patch embedding projection E is replaced by the early stages of a ResNet. One of the intermediate 2D feature maps of the ResNet is flattened into a sequence, projected to the Transformer dimension, and then fed as an input sequence to a Transformer.

Training & Fine-tuning

The authors train all models, including ResNets, using Adam with β1 = 0.9, β2 = 0.999, a batch size of 4096, and apply a high weight decay of 0.1, which they found to be useful for transfer of all models. The authors used a linear learning rate-warmup and decay. For fine-tuning, the authors used SGD with momentum, batch size 512, for all models.

Multi-Layer Perceptron Head

The fully-connected MLP head at the output provides the desired class prediction. The main model can be pre-trained on a large dataset of images, and then the final MLP head can be fine-tuned to a specific task via the standard transfer learning approach. The MLP contains two layers with a GELU non-linearity.

PyTorch’s exact implementation is sufficiently fast such that these approximations may be unnecessary. The image is taken from Papers with Code.

Comparison with SOTA

Breakdown of VTAB performance in Natural, Specialized, and Structured task groups. The image is taken from the paper.

If you enjoyed this article and gained insightful knowledge, consider buying me a coffee ☕️ by clicking here :)

References

  1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
  2. Visual Transformers.
  3. Attention is all you need.

If you liked this post, please make sure to clap 👏. 💬 Connect? Let’s get social: http://myurls.co/nakshatrasinghh.

--

--

Nakshatra Singh
Analytics Vidhya

A Machine Learning, Deep Learning, and Natural Language Processing enthusiast. Making life easy for beginners to read SOTA research papers🤞❤️