Google Brain Uncovers Representation Structure Differences Between CNNs and Vision Transformers

Although convolutional neural networks (CNNs) have dominated the field of computer vision for years, new vision transformer models (ViTs) have also shown remarkable abilities, achieving comparable and even better performance than CNNs on many computer vision tasks. The success of ViTs has raised a number of questions: How are…