An unofficial colab walkthrough of Vision Transformer

Let’s see how it works by running the codes!

Hiroto Honda
1 min readDec 12, 2020

Hi I’m Hiroto Honda, a computer vision researcher. [homepage] [twitter]

Playing with codes is often much more effective to understand machine learning methods than reading papers.

This time I have created a colab notebook for the simple walkthrough of the Vision Transformer.

>>>>[colab notebook] <<<<

You can run the cells directly or make a copy of the notebook in your drive.

Schematic of the Vision Transformer inference pipeline from our colab notebook.

We hope you will be able to understand how it works by looking at the actual data flow during inference.

Have fun!

credit

  • Paper: Alexey Dosovitskiy et al., “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”, https://arxiv.org/abs/2010.11929
  • Model Implementation: the notebook loads (and is inspired by) Ross Wightman (@wightmanr)’s amazing module: https://github.com/rwightman/pytorch-image-models/tree/master/timm . For the detailed codes, please refer to the repo.
  • The notebook was presented at the paper reading group meeting of DeNA Co., Ltd. & Mobility Technologies Co., Ltd. .

--

--