An unofficial colab walkthrough of Vision Transformer

Let’s see how it works by running the codes!

1 min readDec 12, 2020

Hi I’m Hiroto Honda, a computer vision researcher. [homepage] [twitter]

Playing with codes is often much more effective to understand machine learning methods than reading papers.

This time I have created a colab notebook for the simple walkthrough of the Vision Transformer.

>>>>[colab notebook] <<<<

You can run the cells directly or make a copy of the notebook in your drive.

We hope you will be able to understand how it works by looking at the actual data flow during inference.

Have fun!

Paper: Alexey Dosovitskiy et al., “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”, https://arxiv.org/abs/2010.11929
Model Implementation: the notebook loads (and is inspired by) Ross Wightman (@wightmanr)’s amazing module: https://github.com/rwightman/pytorch-image-models/tree/master/timm . For the detailed codes, please refer to the repo.
The notebook was presented at the paper reading group meeting of DeNA Co., Ltd. & Mobility Technologies Co., Ltd. .