Vision Transformer: State-of-the-art image identification technology without convolutional operations

Takehiko TERADA
axinc-ai
Published in
7 min readJan 11, 2023

Introducing “Vision Transformer (ViT)”, a machine learning model that can be used with the ailia SDK.
You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Oveview

ViT is the latest image identification technology that does not use convolution, announced by Google Research, a research division of Google.
It has recorded the highest accuracy in IMAGENET, a famous image recognition competition published by Stanford University.

Image identification is the task of predicting the object that appears in the image.
IMAGENET competes for the accuracy rate of prediction for 50,000 images with 1,000 identification types.
In other words, 50,000 questions with a 0.1% accuracy rate are to be answered, and it is surprising that ViT’s identification accuracy is over 90% correct.

Source:https://paperswithcode.com/sota/image-classification-on-imagenet

The ViT release date appears to be October 2020 for the first version of the paper and a little before that for the program.

In recent years, Convolutional Neural Networks using convolutional operations have been ranked regularly in the top accuracy rankings for image identification.
However, the Vision Transformer has attracted attention for its novelty in that it has recorded the highest accuracy without using convolutional operations.
The Transformer technique was originally proposed by Google Research not in the field of image processing, but in the field of natural language processing.
In other words, the technology used in natural language processing has been adapted to image processing.

Architecture

The network configuration of ViT introduced in the paper is as follows.

Source: https://arxiv.org/pdf/2010.11929.pdf

Here is the brief flow.

(1) Divide the image into pieces on a grid.
(2) Overlay (1) and transform the two-dimensional array of vertical and horizontal pixels in which luminance is stored into a one-dimensional vector array.
(3) (2) is then subjected to adjustable built-in features called “class token” and “position embedding.
(4) Perform the Transformer Encorder calculation multiple times on (3).
(5) Using the features generated in (4), calculate the softmax score (≒ probability value) for 1,000 classes of targets.

Here is an interesting fact: Although the above explanation is given in the paper, the Github repository published by Google Research actually processes (1) differently.
The processing code for the relevant part is as follows.

What is actually being done is the following

(1') A large convolution filter such as 16x16 is used to perform a high compression convolution operation while performing large decimation such as a stride width of 16, and use this as a substitute for overlay images.

The program code also contains the following comments:

# We can merge s2d+emb into a single conv; it's the same.
x = nn.Conv(
features=self.hidden_size,
kernel_size=self.patches.size,
strides=self.patches.size,
padding='VALID',
name='embedding')(
x)

So here’s an illustration of what’s actually going on:

Source : https://pixabay.com/photos/labrador-retriever-dog-pet-labrador-6244939/

The beginning of the Embedding process is a high-compression convolution operation.
The sizes of the array variables for the input image and the highly compressed convolutional features are as follows.

Shape of input data : [1, 3, 224, 224] # Batch, Channel, Height, Width
Shape of convolution feature : [1, 768, 14, 14] # Batch, Channel, Height, Width

Visualization looks like this:
The upper left is the original image, and the rest are 19 out of 768 high-compression convolution features.

Embedding in the above processing flow is performed only once.
Encoder, on the other hand, is executed multiple times.
That’s what the representation in the diagram where the output of the Encoder merges back into the input means.

In addition, what is done in the encoder is the processing called Self Attention in the upper stage and the general multi-layer perceptron processing (hereafter, MLP, which is an acronym for Multi Layer Perceptron) in the lower stage.
And the configuration that repeats these two is the Transformer Encoder in ViT.

After repeating the encoder process multiple times, the features generated through the process are used to perform general softmax output MLP, and the network output is completed.

Various repositories of ViT implementation

This is slightly off topic, but there are several ViT repositories on Github, and each has its own characteristics, so I would like to introduce them here.

First, the following gif video is intuitive and easy to understand, vit-pytorch repository by lucidrains.

While the official repository by Google Research is a JAX implementation, this repository is a pytorch implementation.
I think there are many people who like pytorch, so I appreciate it very much.
This is why the number of “Stars”, which is a “Like” rating for the repository, is more than the official one.

This repository performs Embedding as described in the paper.
In other words, the following process is exactly as described in the paper, and does not really use any convolution operations.

(1) Divide the image into pieces on the grid.

Therefore, this repository may be useful when you want to purely reproduce the implementation of a paper.
However, since no pre-trained models are explicitly provided, you need to prepare your own training data and train from scratch.
Or, you need to convert the JAX trained model to pytorch from the official repository.

Next is PyTorch-Pretrained-ViT by lukemelas, which does not have that many “Stars” but is very precisely implemented.

This repository is also a pytorch implementation, but the main concept is to convert the pre-trained model provided by Google Research’s official repository to pytorch.
The weight values of various pretrained models on JAX are inserted into a general-purpose ViT model implemented in pytorch while converting.

Finally, there is jeonsworld’s ViT-pytorch repository, which even implements an attention map.

https://github.com/jeonsworld/ViT-pytorch

Basically, the same algorithms as in the official repository are implemented in pytorch, and the processing is done while converting a specific learned model.
The “notebook” is also provided to visualize the process and perform a series of operations, so it is very helpful.
The output of the notebook is as follows.

After identifying the target, it also visualizes an attention map that darkens the areas that do not provide the basis for the decision.
The implementation of this attention map is not included in the official repository of Google Research, and it seems that jeonsworld has followed the implementation from the following repository.

The paper on Attention Flow and the conceptual diagram described in the paper are as follows:

Source : https://arxiv.org/pdf/2005.00928.pdf

The ViT in ailia-models also follows jeonsworld’s implementation.
In addition, ailia-models also supports video input, so you can continuously check how the attention map changes from frame to frame.

Using from ailia SDK

The vit programs provided by the ailia SDK are as follows:

To perform processing on an image, use the commands below.

$ python vit.py -i input.png

An execution example is below.

Source : https://pixabay.com/videos/car-racing-motor-sports-action-74/

ax Inc. has developed ailia SDK, which is a self-contained cross-platform high speed inference SDK for rapid AI application development.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--