Paper Explained- Vision Transformers (Bye Bye Convolutions?)
The Limitation with Transformers For Images
Transformers work really really well for NLP however they are limited by the memory and compute requirements of the expensive quadratic attention computation in the encoder block. Images are therefore much harder for transformers because an image is a raster of pixels and there are many many many… pixels to an image. The rasterization of images is a problem in itself even for Convolutional Neural Networks. To feed an image into a transformer every single pixel has to attend to every single other pixel (just like the attention mechanism), the image itself is 255² big so the attention for an image will cost you 255⁴ which is almost impossible even in current hardware. So people have resorted to other techniques like doing Local Attention and even Global Attention. The authors of this paper adapted to use Global Attention.
Vision Transformer Architecture
Patch Embeddings
The standard Transformer receives input as a 1D sequence of token embeddings. To handle 2D images, we reshape the image x∈R^{H×W×C} into a sequence of flattened 2D patches.
Where, (H, W) is the resolution of the original image and (P, P) is the resolution of each image patch. N = HW/P² is then the effective sequence length for the Transformer. The image is split into fixed-size patches, in the image below, patch size is taken as 16×16. So the dimensions of the image will be 48×48.
NOTE: The image dimensions must be divisible by the patch size.
Linear Projection of Flattened Patches
Before passing the patches into the Transformer block the authors of the paper found it helpful to first put the patches through a linear projection. So there is one single matrix and it is called E, in this case, “embedding”, HAHA. They take a patch and unroll it into a big vector and multiply it with the embedding matrix to form patched embeddings and that's what goes into the transformer along with the positional embedding.
Positional Embeddings
Position embeddings are added to the patched embeddings to retain positional information. We explore different 2D-aware variants of position embeddings without any significant gains over standard 1D position embeddings. The joint embedding serves as input to the Transformer encoder.
Each unrolled patch (before Linear Projection) has a sequence of numbers associated with it, in this paper the authors chose it to 1,2,3,4…. no of patches. These numbers are nothing but learnable vectors. Each vector is parameterized and stacked row-wise to form a learnable positional embedding table.
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embedded patches, whose state at the output of the Transformer encoder (zₗ⁰) serves as the image representation y. Both during pre-training and fine-tuning, the classification head is attached to zₗ⁰.
Finally, the row number (initially sequenced number) associated with the patched embedding is picked up from the table (as positional embedding), concatenated, and fed to the Transformer encoder block.
The Transformer Encoder Block
The Transformer encoder consists of alternating layers of Multiheaded self-attention and MLP blocks. Layernorm (Layer Normalization) is applied before every block and residual connection after every block.
Hybrid Architecture (A similar Approach)
As an alternative to dividing the image into patches, the input sequence can be formed from intermediate feature maps of a ResNet. In this hybrid model, the patch embedding projection E is replaced by the early stages of a ResNet. One of the intermediate 2D feature maps of the ResNet is flattened into a sequence, projected to the Transformer dimension, and then fed as an input sequence to a Transformer.
Training & Fine-tuning
The authors train all models, including ResNets, using Adam with β1 = 0.9, β2 = 0.999, a batch size of 4096, and apply a high weight decay of 0.1, which they found to be useful for transfer of all models. The authors used a linear learning rate-warmup and decay. For fine-tuning, the authors used SGD with momentum, batch size 512, for all models.
Multi-Layer Perceptron Head
The fully-connected MLP head at the output provides the desired class prediction. The main model can be pre-trained on a large dataset of images, and then the final MLP head can be fine-tuned to a specific task via the standard transfer learning approach. The MLP contains two layers with a GELU non-linearity.
Comparison with SOTA
If you enjoyed this article and gained insightful knowledge, consider buying me a coffee ☕️ by clicking here :)
References
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
- Visual Transformers.
- Attention is all you need.
If you liked this post, please make sure to clap 👏. 💬 Connect? Let’s get social: http://myurls.co/nakshatrasinghh.