Nerd For Tech
Published in

Nerd For Tech

“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” Paper Summary & Analysis

Paper: https://arxiv.org/pdf/2010.11929.pdf

Discussion led by Victor Butoi & Cora Wu, Intelligent Systems subteam

Objectives of the Paper

What problem is the paper tackling?
Image Recognition at Scale tries to tackle the issue of applying Transformer architecture to Computer Vision tasks to lessen the field’s heavy reliance on CNNs. The paper makes the argument that this transition would produce comparable results to traditional CNNs while requiring less computational resources to train.

What is the relevant background for this problem?
Transformers have been used extensively for NLP tasks, such as the current state-of-the-art models BERT, GPT, and their variations. There’s been some other work done on using transformers on image tasks, but they are generally very cost-heavy.

Paper Contributions

What methods did the paper propose to address the problem?
To adjust the image input to fit the input for the transformer, the paper reshapes the 2D images into a sequence of flattened 2D patches. A learnable embedding was prepended to the sequence of embedded patches. This token serves a similar purpose as BERT’s [class] token. Position embeddings were then added to the patch embeddings to retain positional information.

The transformer encoder consists of alternating layers of multi headed self-attention and MLP blocks. The state of the output of the Transformer encoder serves as the image representation. During pre-training and fine-tuning, a classification head, MLP, is attached to the output of the Transformer encoder. During pre-training the MLP has one hidden layer, and during fine-tuning it is implemented with a single layer.

The Vision Transformer (ViT) was pre-trained on large datasets and then fine-tuned to smaller downstream tasks. Fine-tuning was done by removing the pre-trained prediction head and replacing it with a zero-initialized feedforward layer.

How are the paper’s contributions different from previous related works?
This is not the first paper applying transformers to CV. Facebook has actually already released a model DETR (Detection Transformers); however they were used in conjunction with CNNs rather than standalone. This paper distinguishes itself as a successful application of standalone transformers for CV. For each of the main contributions, it differs as follows:

  • Accuracy with less compute time: ViT has decreased the training time by ~5 times (20% of training time) against Noisy Student (even though it reached approximately the same accuracy as seen in Table 2).
https://arxiv.org/pdf/2010.11929.pdf
  • No Convolutions: In theory, a MLP performs better than a CNN model. However, data has been a large barrier with respect to the performance of MLP models. The inductive bias imposed by CNNs has greatly advanced the field of CV, and with the large dataset used by the authors, they are able to overcome the need for an inductive bias. A transformer is slightly different than a traditional MLP, with its core mechanism being self-attention. This gives transformers the ability to understand the relationship between inputs. When used in NLP, it computes the relation between words in a bidirectional manner, which means the order is less strict, unlike a unidirectional RNN.
  • Efficacy of Transformer: The paper analyzed internal representations of ViT by looking at the output from the attention heads (similar to BERTology papers). The paper found that the model can encode distance between different patches using the position embeddings. The paper also found that ViT integrates information from the entire image even within the lower layers and stated this: “We find that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model.” Furthermore, they paper analyzes the model performance quantitatively, as well as qualitatively visualizing attention maps and focus of the model.

How did the paper assess its results?
The proposed methodology was conducted on three different datasets: Imagenet (1k classes and 21k classes), JFT (18k classes), and VTAB. The results were measured either through few-shot or fine-tuning accuracy, with fine-tuning accuracies representing accuracies after fine-tuning the model on a dataset and few-shot accuracies representing accuracies after training and evaluating on a subset of the images.

They compared the transformer model against popular image classification benchmarks, such as Big Transfer and Noisy Student. For the paper, they configured ViT based on BERT and modified Resnet by replacing Batch Normalization with Group Normalization as well as adopting standardized convolutions to improve transfer learning.

In addition, the paper conducted a preliminary study on self-supervised training ViT, and showed that through self-supervised pre-training, there was an accuracy increase by 2% in comparison to training from scratch.

Paper Limitations, Further Research, and/or Potential Applications

The paper introduces ViT: the use of Vision Transformers as opposed to CNNs or hybrid approaches for image tasks. The results are promising, but not complete, as the performance for vision-based tasks other than classification — such as detection and segmentation — is not present. Moreover, unlike Vaswani et al., 2017, the improvement in performance for transformers is much more limited compared to CNNs. The authors hypothesize that further pre-training could yield improved performance, as ViT is relatively scalable when compared to other state of the art models.

Furthermore, Kaplan et. al. present scaling laws primarily for transformers compared to LSTMs in NLP, demonstrating that transformers can be scaled to much larger datasets. It would be interesting to see if transformers exhibit similar properties in comparison to CNNs. If so, then it is clearly a sign that transformer based techniques will become SOTA in CV as well.

Ultimately, these results point towards a possibility of transformers becoming a universal model, capable of learning across a wide domain of human tasks, and enjoying the ability to scale with data at extraordinary scale. This vision is not here yet, and may never come; if it does, this paper will be considered a harbinger of the future.

--

--

--

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Recommended from Medium

Deep Q-learning from Demonstrations (DQfD) in Keras

Introduction to Word Embeddings and its Applications

Software Galaxies Visualization of Glove Embeddings

Text Summarization - Implementation

Capsule Neural Networks — The future for autonomous vehicles

Custom Models with Baselines: IMPALA CNN, CNNs with Features, and Contra 3 Hard Mode

Batch-Constrained Deep Q Learning in TensorFlow

DeepClassifyML Week 1 Part 1

Building Token Recommender in Google Cloud Platform

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cornell Data Science

Cornell Data Science

Cornell Data Science is an engineering project team @Cornell that seeks to prepare students for a career in data science.

More from Medium

RL Hands-on Workshop - Train a Chatbot with Reinforcement Learning

My Experience as a Computer Vision Intern

How to Pick Optimal Learning Rate Using TensorFlow 2.x

When natural language processing and computer vision meet.