Understanding representations of concepts in Visual Transformers by analyzing attention maps from pre-trained models

8 min readDec 16, 2021

Convolutional networks have been widely studied since they became the main method for computer vision tasks such as image classification. Its internal mechanisms, how it represents visual concepts layer by layer, is well known. Recently, researchers started to use an architecture recognized as the state-of-the-art method for many NLP tasks, called Transformers [1], for image processing. These architectures showed they can achieve superior performance in vision tasks and raised the question of how visual concepts are represented internally.

In this project, we attempt to understand these internal representations by looking at attention maps from pre-trained Visual Transformers. The main goal is to understand if a certain pattern emerges as the network accumulates information from image tokens trying to build the image’s representation.

Background

Most of the recent breakthrough achievements in Natural Language Processing (NLP) are attributed to self-attention-based architectures, in particular Transformers. Originally designed for language translation, their core mechanism is able to learn long-range dependencies of tokens (words) sequences and create representations that can generalize in different tasks. These models, such as BERT and GPT, are often pre-trained on a large corpus and later fine-tuned for a specific language task, offering a robust yet flexible way to achieve state-of-the-art results in many NLP tasks.

Convolutional Neural Networks (CNN) have achieved similar success in computer vision for many years and are still dominant in tasks such as image classification [2]. Different from transformers, CNN-like architectures are not very good at tracking long-range dependencies across an image and require a substantial increase in computational power to do so. Some works combined CNNs with self-attention, however ResNet-like architectures still prevail in large-scale image recognition competitions.

With that in mind, recent work has focused on applying standard Transformers to images, without embedding it with existing CNN architectures. The basic analogy of tokens (words) in NLP is then translated to image patches which maintain some form of dependency across its neighbors. This approach has led to results close to state-of-the-art convolutional networks as in Vision Transformers (ViT).

In order to understand how they process image data more investigation is needed on its internal representations. This article will investigate attention maps from a pre-trained network, named ViT, at different depths in the network trying to understand how the model represent visual concepts.

Motivation

Recurrent neural networks (RNN), long short-term memory (LSTM) and gated recurrent networks all proved to be successful in sequence modeling and performed extremely well in language modelling and machine translation tasks. The introduction of attention mechanisms allowed these models to handle dependency regardless of the distance from input or output sequences. Despite of their good performance, they were often complex and led to computational limitations. The Transformer architecture, introduced by “ Attention Is All You Need“ [1], is a simpler approach based only on attention mechanisms, reporting better performance and lower training cost if compared to previous models based on recurrent or convolutional neural networks. Today, this approach is considered the state-of-the-art architecture for language modeling and machine translation.

In Transformers, self-attention allows every element of a sequence (tokens) to interact with others and figure out which ones are more relevant.

Self-attention in sequence of words. The highlighted word on the right column “it” is related to other words, however it is strong related to “street”.

The image above show the relation of the word “it” and all other words in sequence of words. Humans can easily understand from the text that the “it” refers to “street”. In self-attention all words (tokens) have a correspondent weight to each other word showing how much they depend on, allowing the model to represent very well dependencies.

In vision models using vanilla CNNs only local dependencies can be captured by regular size filters, however large filters can be used to capture more long-range at the expense of computational cost. For example, the internal representation of both faces on the picture would be similar in a CNN since it does not encode position relative to different features.

To tackle such problem various works tried to embed attention mechanisms into CNNs[3], with relative success compared to vanilla CNNs, but they didn’t scale well on modern hardware accelerators, leaving classic ResNet-like architectures dominant in large-scale image recognition.

Introduced by google in its paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” [4] in December 2020, Vision Transformes (ViT) is a vision model inspired by the work done with Transformes in NLP tasks. It tries to stay as close as possible to the original concept without relying on Convolutional Networks. The idea of splitting a sequence of inputs in tokens is adapted to the image domain by splitting an image into a sequence of image patches, as shown below.

How ViT splits an image into tokens and processes it.

ViT achieved results comparable to state-of-the-art on multiple image recognition benchmarks while requiring less computational resources to train. Since the research on vision models based on transformers is relatively new, there’s still some work to do on understanding the relationship between the different layers of a ViT network and how visual concepts are represented internally.

This blog post tries to shed some light on the topic by visualizing the attention maps of each layer of a ViT-Base model. The goal is to answer two questions:

What does the network “sees” in each layer?
How well does it internal representation matches the input?

Methodology

The original Vision Transformer paper presents three different main architectures of the ViT model. In order to keep the experiment under control even in common hardware, the experiments used the ViT-Base model with 12 layers and 12 attention heads. The model checkpoint used was pre-trained on the ImageNet and ImageNet-21k datasets and it accepts images of size 224x224 (patch size is 16x16).

The CalTech101 dataset was chosen as test set for extracting the attention maps. It contains images from 101 different categories, with size roughly 300x200 pixels, which is close to the input size of the network and will require less time pre-processing the original image.

In order to obtain the attention maps, first we need to be able to run an image through the network and then inspect its intermediate layers. The implementation of the ViT model in PyTorch from https://github.com/jeonsworld/ViT-pytorch has an option on the Encoder that allows the intermediate weights of each layer to be exposed. At first that seemed enough for visualizing the attention maps, however one has to understand the output of each layer.

First, the input image is split into tokens of size 16x16, totaling 196 tokens. Each one of these tokens is fed into the encoder, which has 12 attention heads and 12 layers. The output of the weights therefore will be (12x12x196x196), where the first 12 is the layer dimension, the second 12 is the number of attention heads, and the third and forth are the actual image.

Results

At first, plotting raw weights without any pre-processing was discouraging. The results obtained didn’t show any pattern and in most cases only a diagonal line could be seen. After researching a bit, we came across an implementation of feature map visualization which performed a few extra steps in order to generate the final visualization:

Attention weights were averaged across all attention heads;
To account for residual connections an identity matrix was added to the attention matrix;

What was done in (1) aggregated all attention heads in each layer and now the result can be shown as attention map by layer.

Also, by the nature of the network, each hidden state is fed back to the encoder, therefore the same logic was applied to the attention weights in order to obtain the final attention map, here called joint attention map. Below is an example of random image from CalTech dataset classified using ViT and its corresponding predicted classes (top 5) and joint attention map.

ViT prediction and attention map sample.

Next, I randomly sampled four different images to analyze their attention maps for each layer, side by side.

Conclusions

The visualizations showed a pattern similar to what is seen when visualizing convolutional filters in CNNs. Visual Transformers can already express a complex object in its first layer, but the best visualization is achieved through multiplication of all attention weights in each layer (joint attention map).

In deeper level of the network, the main object is not totally clear but a “shadow” of the main object can still be observed (with the exception of the very first column, where layer 12 highlights the main object).

It is also shown that the network “sees” the object in its multiple stages and in the case of the joint attention map it matches nicely the input picture.

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.

[2] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.

[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.

[4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint arXiv:2010.11929, 2020

[5] https://becominghuman.ai/transformers-in-vision-e2e87b739feb

[6] https://github.com/jeonsworld/ViT-pytorch