ViT — VisionTransformer, a Pytorch implementation

Alessandro Lamberti
Artificialis
Published in
5 min readAug 19, 2022

--

The Attention is all you need’s paper revolutionized the world of Natural Language Processing and Transformer-based architecture became the de-facto standard for natural language processing tasks.

It was only a matter of time before someone would actually try to reach the state of the art in Computer Vision, with attention mechanism and transformer architectures.

Despite the fact that convolution-based architectures remain state of the art for image classification tasks, the paper An image is worth 16x16 words: Transformers for image recognition at scale shows that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
How?

On a very high level, an image is split into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application.

However, natively lacking of CNN’s inherent inductive biases, like locality, Transformers do not generalize well when trained on insufficient amounts of data. It does however reach or beats state of the art on multiple image recognition benchmarks, when trained on large datasets.

--

--