Vision Transformers | one minute summary
New idiom: “An Image is Worth 16x16 Pixels”
Published in
1 min readJul 9, 2021
The 2020 paper: “An Image is Worth 16x16 Pixels: Transformers for Image Recognition at Scale” by Dosovitskiy, A., et al (Google) introduced the Vision Transformer, which at first just seemed like a cool extension of NLP Transformers but which has now proved to be very effective for computer vision tasks.
Prerequisites: Transformers
- Why? For computer vision tasks, all the best models have typically been ConvNets (e.g. ResNet, Vgg, Inception). But with Transformers being successful for NLP tasks, can they be used for computer vision as well?
- What? The Vision Transformer (ViT) is a modified NLP Transformer (encoder only) for image classification without any convolutional layers
- How? An image is split into patches (16x16x3), which are then flattened and put into a lower-dimensional embedding space. An extra token is added to each vector (i.e. linearly-embedded patch) to denote its relative location in the image (i.e. positional embedding), and another, learnable token is added to the entire sequence of vectors to denote the class. The sequence of vectors is fed to the standard Transformer encoder, which has been modified with an extra fully-connected layer at the end for doing classification.