An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale (Brief Review of the ICLR 2021 Paper)

Stan Kriventsov
The Startup
Published in
5 min readOct 9, 2020

--

In this post I would like to explain, without going into too much technical detail, the significance of the new paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” submitted by its authors (so far anonymously due to the double-blind review requirements) to the 2021 ICLR conference. In another post, I provide an example of using this new model (called the Vision Transformer) with PyTorch to make predictions on the standard MNIST dataset.

October 29 update: The paper, although still under review, has now been posted on Arxiv and lists Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov et al from Google Research as its authors. The fine-tuning code and pre-trained models are available at https://github.com/google-research/vision_transformer.

Background

Deep learning (machine learning using neural networks with more than one hidden layer) has been around since the 1960s, but it truly came to the forefront in 2012 when AlexNet, a convolutional network (in simple terms, a network that first looks for smaller patterns in each part of the image and then tries to combine them into an overall picture) designed by Alex Krizhevsky, won the annual ImageNet image classification competition by a large margin.

Over the next years, deep computer vision techniques experienced a true revolution, with new convolutional…

--

--

Stan Kriventsov
The Startup

Software/ML Engineer at Google. Founder of Deep Learning Reviews: https://www.dl.reviews. Former pro chess and poker player.