How using self-attention for image classification reduces inductive bias inherent to CNNs including translation equivariance and locality, thus improving performance compared to ResNets when pre-trained with much larger datasets such as ImageNet-21k — *This post’s associated Colab Notebook contains step-by-step code for downloading pre-trained ViT model checkpoints, defining a model instance, and fine-tuning ViT.