Vision Transformers | one minute summary

New idiom: “An Image is Worth 16x16 Pixels”

Published in

One Minute Machine Learning

1 min readJul 9, 2021

Image modified from Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” *arXiv preprint arXiv:2010.11929* (2020).

The 2020 paper: “An Image is Worth 16x16 Pixels: Transformers for Image Recognition at Scale” by Dosovitskiy, A., et al (Google) introduced the Vision Transformer, which at first just seemed like a cool extension of NLP Transformers but which has now proved to be very effective for computer vision tasks.

Prerequisites: Transformers

Why? For computer vision tasks, all the best models have typically been ConvNets (e.g. ResNet, Vgg, Inception). But with Transformers being successful for NLP tasks, can they be used for computer vision as well?
What? The Vision Transformer (ViT) is a modified NLP Transformer (encoder only) for image classification without any convolutional layers
How? An image is split into patches (16x16x3), which are then flattened and put into a lower-dimensional embedding space. An extra token is added to each vector (i.e. linearly-embedded patch) to denote its relative location in the image (i.e. positional embedding), and another, learnable token is added to the entire sequence of vectors to denote the class. The sequence of vectors is fed to the standard Transformer encoder, which has been modified with an extra fully-connected layer at the end for doing classification.

Vision Transformers | one minute summary

New idiom: “An Image is Worth 16x16 Pixels”

Written by Jeffrey Boschman