Vision Transformers | one minute summary

New idiom: “An Image is Worth 16x16 Pixels”

Jeffrey Boschman
One Minute Machine Learning
1 min readJul 9, 2021

--

Image modified from Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

The 2020 paper: “An Image is Worth 16x16 Pixels: Transformers for Image Recognition at Scale” by Dosovitskiy, A., et al (Google) introduced the Vision Transformer, which at first just seemed like a cool extension of NLP Transformers but which has now proved to be very effective for computer vision tasks.

Prerequisites: Transformers

  1. Why? For computer vision tasks, all the best models have typically been ConvNets (e.g. ResNet, Vgg, Inception). But with Transformers being successful for NLP tasks, can they be used for computer vision as well?
  2. What? The Vision Transformer (ViT) is a modified NLP Transformer (encoder only) for image classification without any convolutional layers
  3. How? An image is split into patches (16x16x3), which are then flattened and put into a lower-dimensional embedding space. An extra token is added to each vector (i.e. linearly-embedded patch) to denote its relative location in the image (i.e. positional embedding), and another, learnable token is added to the entire sequence of vectors to denote the class. The sequence of vectors is fed to the standard Transformer encoder, which has been modified with an extra fully-connected layer at the end for doing classification.

--

--

Jeffrey Boschman
One Minute Machine Learning

An endlessly curious grad student trying to build and share knowledge.