Are You Ready for Vision Transformer (ViT)?

“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” May Bring Another Breakthrough to Computer Vision

Yoshiyuki Igarashi
TDS Archive
Published in
6 min readOct 11, 2020

--

Lives on earth face a cycle of rise and fall. It is applicable not only for creatures but also for technologies. Technologies in data science have been filled with hypes and biased success stories. Having said that, there are technologies that have lead to the growth of data science: Convolutional Neural Network (CNN). Since AlexNet in 2012, different architectures of CNNs have brought a tremendous contribution to real business operations and academic researches. Residual Networks (ResNet) by Microsoft Research in 2015 brought a real breakthrough to build “deep” CNNs; however, an honorable retirement of this technology would be approaching. Geoffrey Hinton, a father of neural network and one of the 2018 Turing Award winners, has been mentioning the flaws of CNN for years. You can find one of his seminars “What is wrong with convolutional neural nets?” in 2017. A major flaw of CNN exists in Pooling layers because it loses a lot of valuable information and it ignores the relationship between the part of images and the whole. In replacement of CNN, Geoffrey Hinton and his team had published a paper on Capsule Nets in 2018; however, it has not replaced CNNs yet.

Table of Contents

  1. Intro to An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  2. Why Vision Transformer (ViT) Matters?
  3. Closing
  4. Materials to Study Further

Intro to An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

I learned about this paper from Andrej Karpathy’s tweet on Oct 3, 2020.

Tweet by Andrej Karpathy about Vision Transformer.
Screenshot taken by author. This tweet was created by Andrej Karpathy.

Andrej Karpathy is a senior director of Artificial Intelligence at Tesla, and he used to teach a class CS231n in 2016, which covered the topics on Computer Vision at Stanford University. Even though the contents were outdated, he showed great skill to present difficult concepts in simple words. I had learned a lot from his class.

The purpose of this post is to give a heads-up to machine learning engineers and data scientists who have not understood Transformer to prepare themselves before the “innovative tech company” launches a GitHub repository for Vision Transformer.

Who Wrote This Paper?

Screenshot taken by author. The source is the title page on the paper “An Image is Worth 16x16 Words.”

I usually check the names of authors/organizations to identify the credibility of papers before reading. This paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, was submitted on Sep 28, 2020, and the author's names have not been revealed yet since the paper is under double-blind review. I would not explicitly mention the company’s name. However, you would be able to make an educated guess who can afford to spend 2,500 TPU days to train a model (highlighted below), and there is another clue that the model was trained on JFT-300M, a private dataset of 300 million images.

Screenshot taken by author. The source is Table 2 on the paper “An Image is Worth 16x16 Words.”

Why Vision Transformer (ViT) Matters?

This is not the first paper applying Transformer to Computer Vision. Facebook released Detection Transformers (DETR) in May 2020; however, DETR used Transformer in conjunction with CNN. ViT is the most successful application of Transformer for Computer Vision, and this research is considered to have made three contributions.

High Accuracy with Less Computation Time for Training

ViT has decreased the training time by 80% against Noisy Student (published by Google in Jun 2020) even though ViT has reached the approximately same accuracy as Table 2 on the paper (above) shows. Noisy Student adopted the EfficientNet architecture, and I will write another blog post about EfficientNet to help readers to see how far CNNs have traveled since ResNet in the near future.

Model Architecture without Convolutional Network

The core mechanism behind the Transformer architecture is Self-Attention. It gives the capability to understand the connection between inputs. When Transformers are applied for NLP, it computes the relation between words in a bi-directional manner, which means the order of input does not matter unlike RNN. A model with Transformer architecture handles variable-sized input using stacks of Self-Attention layers instead of CNNs and RNNs. You can learn more about Transformer in my last post written in layman’s terms for business people, “Minimal Requirements to Pretend You are Familiar with BERT”.

A major challenge of applying Transformers without CNN to images is applying Self-Attention between pixels. If the size of the input image is 640x640, the model needs to calculate self-attention for 409K combinations. Also, you can imagine that it is not likely that a pixel at a corner of an image will have a meaningful relationship with another pixel at the other corner of the image. ViT has overcome this problem by segmenting images into small patches (like 16x16). The atom of a sentence is a word, and this research defined a patch as the atom of an image instead of a pixel to efficiently tease out patterns.

Screenshot taken by author. The source is Figure 1 on the paper “An Image is Worth 16x16 Words.”

Efficacy of Transformer with Small Patches

The paper has analyzed the internal representations of ViT by analyzing the intermediate results from Multi-Head Attention. The paper has discovered that the model is able to encode the distance of patches in the similarity of position embeddings. Another discovery is that the paper found ViT integrates information across the entire image even in the lowest layers in Transformers. As a side-note, ViT-Large has 24 layers with the hidden size of 1,024 and 16 attention heads. The quote from the paper is “We find that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model.”

Analyzing the model performance qualitatively is often as important as analyzing quantitatively to understand the robustness of predictions. I usually use Class Activation Map (by MIT in 2015) to validate the robustness of model performance by reviewing class activation maps from the images with correct predictions, false-positives, and false-negatives to create and test different hypotheses.

Closing

I rarely read papers under review because the contents of submitted papers will be revised and many of them would be even declined by journals. But, I wrote this post because the contents are really innovative, and I also liked the poetic title of the paper! I plan to make some updates on this post when the paper is officially published.

Update: Dec 4, 2020

The official repository for Vision Transformer is ready. Enjoy a life with ViT!

Materials to Study Further

  1. You can read the submitted paper, Google An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, at OpenReivew.net.
  2. The Illustrated Transformer by Jay Alamer is the best material to understand how Transformer works step by step with extremely helpful images.
  3. If you want to understand the application of Transformer without math, my blog post, Minimal Requirements to Pretend You are Familiar with BERT, will help you since I targeted the readers to be business people and junior-level data scientists.
  4. If you are interested in the state of the art model of Computer Vision using CNN by Google Brain and Research teams (as of Feb 2021) before Vision Transformer dominates this domain, you can see the anatomy without math in Simple Copy-Paste is a Game Changer for Computer Vision Problems.
  5. The last material is not directly relevant to study the concept of Transformer, but I have been asked about how to implement Transformers by readers. If you already have a basic understanding of Transformers, firstly you can learn about how to use PyTorch from this post, Understanding PyTorch with an example: a step-by-step tutorial, and then you can go through HuggingFace’s quick start to create your first Transformer model. Enjoy Transformers!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Yoshiyuki Igarashi
Yoshiyuki Igarashi

Written by Yoshiyuki Igarashi

ML Engineer — 10+ years’ experience to deliver end-to-end solutions for real business problems. https://www.linkedin.com/in/yoshiyuki-igarashi/

Responses (3)