Vision Transformers: Transforming Computer Vision to the next level

Limas Jaya Akeh
Bina Nusantara IT Division
2 min readMar 29, 2022
Photo by Andrew Itaga on Unsplash

So you’ve been working on your next Computer Vision Project, and you wonder — which model should I use? Should I use AlexNet, or maybe ResNet? Convolutional Neural Networks (CNN) has always been the go-to solution for most tasks in Computer Vision application, however, a new State-of-the-Art proposed by Dosovitskiy et.al. tries to use Transformers Encoder (the Self-Attention one, not the film) for Computer Vision tasks. Yes, you heard it right. Transformers used in Natural Language Processing tasks is now also used for Computer Vision tasks.

On this post, I will give a very brief high-level ELI5 explanation that is easily understandable for anyone!

Vision Transformer Architecture by Dosovitskiy et.al.

A Vision Transformer will perform the following:

  1. Convert the image into patches (or fragments of an image)
  2. Add a special vector embedding at the very start (called Class Embedding), which will be the output
  3. Process all patches simultaneously
  4. Take the special vector embedding out, this vector will now contains information
  5. Use it to perform the task you want, such as Image Recognition

If you’re like me, you might be confused at start. The simplest explanation is “Imagine using Transformers, but you convert the image to words”. At Transformers architecture, you want to convert a sentence of words into some sort of math vector, Vision Transformer basically convert an image into small patches or fragments, which can be thought as “words”, and an image as a “sentence”.

These patches, combined with an information about their position (since order of images matter for context of the image) will then be passed through the Transformer Encoder, which would produces an information about the image.

A special embedding called Class Embedding, contains the information about the class or target you’re trying to predict, processing it through the encoder would then give you a processed embedding. You can now use this and feed it to any other ML models, for example, a MLP head for image recognition tasks.

--

--