AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

4 min readDec 29, 2021

Source: https://unsplash.com/photos/fEVaiLwWvlU

In this post I would like to explain and summarize the paper that was published by Google in ICLR (2021). The authors compared performance between ResNet and Vision Transformer (ViT) with few different sizes datasets and the results show that the transformer can achieve better result compare to CNN.

Architecture — Transformer in computer vision

The transformer follows the original Transformer’s architecture as close as possible.

Steps:

1. Split the image into 16*16 patches.

2. Flatten the image and concatenate it with the position embedding.

3. Pass the training parameters into the transformer.

4. The output is send to a MLP Head which works as a classifier.

Positional embeddings are added to retain positional information, the authors tried 2D position embeddings but no significant improvements. A 2D position embeddings can be first row first image, first row second image, first row third image (11,12,13,21,22,23 etc).

Transformer encoder [1]

The embedded patches will be pass through a normalization layer, then pass through a multi-head attention layer.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The attention layer generates Query, Key and Value by multiplying the vector with trained weight matrices.

The matrix of the output is calculated by:

where dk is the dimension of the key vectors.

Inductive bias

The visual transformer has less image-specific inductive bias than CNNs. Inductive biases are hard-coded into the architectural structure of CNNs in the form of two strong constraints on the weights: locality and weight sharing whereas the visual transformer has a more flexible self-attention layers, this enables the vision transformer learn local and global dependencies collaboratively.

Experiment

The paper evaluated 3 models: ResNet, 3 Vision Transformer (ViT) variants and hybrid. Few pre-training datasets were used in the experiment, they are ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images refer as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images and JFT with 18k classes and 303M high-resolution images.

Pre-trained models are tested on several benchmark datasets: ImageNet on the original validation labels and the cleaned-up ReaL labels, CIFAR-10/100, Oxford-IIIT Pets , and Oxford Flowers-102.

Results

The results show that ViT outperformed ResNet in all datasets.

The visualization on the left shows that the ViT learn features similar to CNN models. The visualization in the middle shows that the ViT learns to encode the relative location of patches, it is able to reproduce the image structure. The last visualization shows that the ViT is able to look at local and global features in lower layers.

Reference

[1] Ashish, V., Noam, S., Niki, P. et al. (2017), Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. https://arxiv.org/pdf/1706.03762.pdf

AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE