ViT. Vision transformer — Paper Summary

Published in

the last neural cell

3 min readNov 18, 2021

#01 Review.
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

Tags: #ML#CV

🧐 Short view.
The authors provide the first evidence of transformer encoder application for image classification. The intuitive is the following: describe image as a sequence of vectors, feed a lot of data and see what happens. Spoiler: they beat previous SOTA results with model pretrained on largescale image datasets.

github — https://github.com/google-research/vision_transformer
paper — https://arxiv.org/abs/2010.11929]

🤿 Motivation.

- Transformers dominate in solving NLP tasks.
- Can we adopt this approach to image and replace standard CNN approaches?
- Transformer has more global receptive field in comparison with CNN models. It might be helpful for any kind of tasks.

The authors tested these hypotheses on commonly used large image datasets

🍋 Main Ideas.

[For visualization purposes see figure 1 in attached poster]

1) Patch extraction
Apply sliding window for 16x16 extraction windows. Then we flat into N=16*16*3D vectors.
Then use linear embeddings for mapping into certain size of tokens.

2) Learnable position embedding:
It is necessary to retain 2D space information. Authors use learnable position embeddings instead of using fixed position embeddings (e.g. sin, cos). Also they show that learnable embeddings work better than fixed.
❕ Positional embeddings — vectors which we add to sequence of tokens to code the position information
3) Learnable classification token [CLS].

Concatenate additional learnable token to image tokens. Motivation is that token can capture global information aggregated from the other tokens.

📈 Experiment insights.

First, Vision Transformers dominate ResNets on the performance/computation trade-off. ViT uses approximately 2–4 less computation to attain the same performance (average over 5 datasets)
Second, hybrids slightly outperform ViT at small computational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts
Position embedding can not work with variable input size. They do 2D-interpolation of positional embeddings for fine tuning on smaller or bigger data sets.
It is possible to analyze weight in CLS token and obtain semantically consistent areas.
They tested self supervised → do not work well (but better than w/o)

Training parameters:

Linear learning rate warmup and decay
— Adam with batch = 4096, high weight_decay = 0.1 → improves transferability.
— Dataset description.
— ImageNet — 1k classes 1.3m images
— ImageNet21–21k calsses with 14M images
— JFT — 18k classes with 303M images( high resollution)

✏️ My Notes. How can we use it for brain signals analysis?

The application of transformers to image classification shows their potential to be applied to other fields as well. Especially it can be used in time-series analysis as self-attention captures the interaction between input components. As well as the CLF token — which should capture features describing global class attributes.
Capacity of transformer models is so large that we can approximate much more complicated functions than CNN. All we should do is just feed enough data! As a result everything should work well.
We can try to adopt this model for brain signals analysis:
— We can represent brain activity as a sequence of tokens (electrode feature vector on each time step)
— The problem is how do we get so much data to pretrain the model? Maybe it is not essential as the complexity of approximated functions in BCI research requires less parameters.

Made in collaboration with Алексей Тимченко

ViT. Vision transformer — Paper Summary

Written by Alexander Kovalev