Attention Mechanism in Vision Models

16 min readNov 12, 2021

In this article, we would like to explore the attention mechanism and subsequently understand its application in vision models. Attention was first introduced in the paper by Bahdanau et al. for neural machine translation. Attention is a technique that enables a network to focus better on the parts of the input data that is more important to making a prediction. Since being introduced, it has revolutionized the entire field of NLP by being a key component in all the state-of-the-art models for a variety of tasks.

We will be exploring the following four papers where attention is used:

The first paper we are discussing is ‘Attention Is All You Need’ published by Google Brain. Although this revolutionary paper (more than 30K citations!) is relevant only for NLP, this was the first paper to purely use attention through the transformer architecture. This makes it an important paper to understand before moving on to the latter three vision-based papers.

Transformers Explained

The ‘Attention Is All You Need’ paper tackles the task of machine translation in NLP. A major goal of this work is to better model long-range dependencies in input sequences to make sense of the global context. They also seek to achieve better parallelization of the algorithm (unlike the status quo at that point) leading to better hardware resource utilization in GPUs, etc. To achieve this goal, the authors propose the novel transformer architecture which entirely uses the attention mechanism and eschews all forms of sequential (think RNNs, LSTMs, etc) modeling architectures.

With a desire to keep the article succinct, we will provide a brief overview of the paper here. Interested readers may refer to this excellent article for a more in-depth yet simple-to-understand description.

At a high level, the proposed model has an encoder-decoder style architecture where the former encodes the information in d_model dimensional vector and the latter decodes the output of the model. There are 6 encoders and decoders each. Each encoder is composed of two sub-layers — the multi-head self-attention module (will be described later) followed by a fully connected feedforward network. The decoder in addition has a third sub-layer — multi-head attention layer that is performed over the outputs of the encoder.

Multi-Head Attention

What is the attention function? As written in the paper, “An attention function can be described as mapping a query and a set of key-value pairs to an output where the query, key, values, and output are all vectors”.

The authors implement a version of attention known as the scaled dot-product attention. Here, we have a d_k dimensional key and query vector and the values are a d_v dimensional vector. Here, we apply softmax of the normalized dotproduct between a query and all the keys. Subsequently, this output acts as weights for the value vector. Essentially, this query helps the network figure out which of the values are important (i.e., to attend to). In addition, this attention is not done just once, but h (=8) times by using learned linear projections of the queries, keys, and values. Every such attention calculation is performed in parallel, thus being called multi-head attention. We perform multi-head attention since it allows us the network to attend to different learned representations at different points of the sequence. The equation and the figure given below is useful for better understanding.

Scaled Dot-product Attention (left) and Multi-Head Attention (right)

Another key aspect of this architecture is positional embeddings. Unlike the prior art, this network is not sequential in nature, therefore, we need a way of representing the relative positions of the input tokens in a sequence. To do this, the authors use cosine and sine functions of different frequencies. The equation below captures this key detail:

where pos is the position and i is the dimension.

Experimental Results

The authors analyze the performance of the model on the task of machine translation. One dataset used is the WMT 2014 English-German dataset which contains about 4.5 million sentence pairs. They achieve a BLEU score of 28.4 using the transformer big version. This score is 2.0 BLEU more than the previous best performing model. Similarly, for the English-French translation task, their model does much better than the previous state-of-the-art at 1/4th the training cost. These results clearly demonstrate the effectiveness of this new architecture for both the accuracy of translation and the computational cost required. More intricate details can be found in the paper.

BLEU scores for the machine translation task

The authors also carry out different ablation studies by varying the attention key, value dimensions, and the number of attention heads while keeping the computation constant. By performing these different experiments, they arrive at the exact configuration as mentioned previously. The authors also check the model generalization to the English constituency parsing task and achieve state-of-the-art performance.

Insights, observations, and future work

The authors’ novel method outperforms all state-of-the-art models for machine translation while eschewing all forms of sequential modeling. This paper has stood the test of time four years after it was published. This transformer architecture has become the baseline for all the latest state-of-the-art models making this work highly significant and influential. It is important to understand this paper in detail before moving to the latest methods.
The good performance of this model makes it worthwhile to see if it is possible to extend the benefits to other modalities such as images and video. We are going to do that through the subsequent 3 papers in this article.

Now we will move to the ‘CBAM: Convolutional Block Attention Module paper’.

CBAM Explained

This paper’s main contribution is a novel module called the Convolutional Block Attention Module (CBAM). This module can be added to any convolutional neural network (CNN) in a plug-and-play fashion. It improves performance on classification and detection benchmarks such as ImageNet-1K, MS COCO, and VOC 2007 when used with existing architectures as a baseline.

We will now describe this module in more detail. It is straightforward and easy to understand. Let us say we have an intermediate feature F ∈ R_C×H×W as the output of a convolutional layer. Depending on the depth of the layer, each such feature layer captures useful information such as simple edges, shapes, etc to more complicated semantic representations of the input. We would like the network to focus more (i.e., give attention) on the important parts of these feature maps. To do this, the authors propose the channel attention module and the spatial attention module as two components of the earlier mentioned CBAM.

Channel Attention Module

The channel attention module identifies the important parts of each channel in the feature map by using max-pooling and average pooling operations. This generates two spatial context descriptors F_c_avg ∈ R_Cx1x1 and F_c_max ∈ R_Cx1x1. Subsequently, these descriptors are fed into a single hidden-layer multi-layer perceptron network to generate a one channel attention map. This attention map M_c ∈ R_Cx1x1 is element-wise multiplied with the original features to generate a refined intermediate feature map. These operations can be easily visualized in the above figure and the below mathematical expression.

Spatial Attention Module

The authors do not stop at this point. They add additional attention mechanisms called the spatial attention module. Complementary to the previous approach, max-pooling and average-pooling operations are applied along the channel axis to generate F_s_avg ∈ R_1xHxWand F_s_max ∈ R_1xHxW. Both these spatial maps are input to a single convolution layer which produces a final spatial map M_s ∈ R_1xHxW. This map is subsequently element-wise multiplied with the feature map that we get after applying channel-wise attention. According to the authors, this spatial attention helps the network better learn ‘where’ the important information lies. This can also be summarized through the below figure and the below mathematical expression.

Equation for spatial attention module operation

Experimental Results

We will now provide some highlights of the experiments performed and the results achieved by the authors. More intricate details can be found in the paper.

The authors test the performance of this module on ImageNet-1K classification. This paper was published in July 2018. The winner of the ILSVRC 2017 (ImageNet competition) is the squeeze and excitation (SENet) network. The CBAM-integrated network outperforms the SENet. For example, while ResNet50 + SE achieves a Top-1 error of 23.14%, ResNet50 + CBAM achieves 22.66% on ImageNet-1k (single crop validation error). More details can be found in the table below:

ImageNet-1K validation error for classification

Similarly, the performance on MS COCO and Pascal VOC was also bench marked for the object detection task. Using ResNet50 as backbone and Faster-RCNN as the detector, they achieved good performance gains. The mAP@.5 improved from 46.2 to 48.2. More details are in the table below:

MS COCO validation error for object detection

To provide some qualitative evaluation of the proposed module, the authors utilize Grad-CAM. Grad-CAM uses gradients to show which parts of the input image impacts most the decision of the model. As the picture below shows, feature refinement using CBAM helps the network learn better underlying representations of the input images.

Insights, observations, and future work

This module can be easily added to any existing architecture with minimal modifications and we get straightforward improvements in performance. This is very useful for practical use.
This paper has compared its performance with the Squeeze-and-Excitation Networks (SENet). In fact, in CBAM, if we use only the channel attention module with average pooling (no max-pooling), then this network simplifies to the SENet. From that perspective, this network is a simple extension of the SENet by adding some bells and whistles.
We can empirically understand that the channel attention module and the spatial attention module improve the performance over the baseline. However, it is not very clear why it works the way it works. The paper could have done some more analysis into explaining the ‘why’ behind the module.
One potential area of further research would be to analyze if similar performance benefits can be seen for tasks such as segmentation, human pose estimation, dense depth estimation, etc.

Attention Augmentation Explained

Like we’ve seen above, attention applied to visual problems was successful, in the CBAM module and others like it. The idea behind the ‘Attention Augmentation’ paper is to augment a baseline convolutional network with attention maps. Instead of applying attention as a transform to improve features like in CBAM, here we compute attention maps in parallel with convolution (see figure below) and concatenate the results. The high-level idea being that the convolution captures short-range interactions and the attention map, in this case being global, captures long range interactions.

How are the attention maps computed?

The attention map is computed globally for a pixel, as follows: Given an input tensor (H, W, F_in), it’s first flattened to a HW x F_in shaped matrix. Then, multi-head attention is applied as described in the Transformer[3] paper. The output of multi-head attention is the concatenation of the attention output at each head multiplied by a learned weight matrix. This is then reshaped into the shape (H, W, d_v), where d_v refers to the depth of values in multi-head attention. Now, this result is in the same shape as the result from the convolution (and so they can be concatenated).

There is still a detail missing, however: self-attention on its own is permutation equivariant, i.e., the result of self-attention will not change if we permute the input (at the earliest case, this would be pixels). Intuitively, this seems bad — images are structured data and it seems that a permutation equivariant operator would give a bad result if used as a feature. To fix this, we need a positional embedding.

The authors tried various known encodings such as the sinusoidal encoding from the Transformer[3] paper and positional channel concatenation from CoordConv, but they found out that they aren’t effective because they are not translation equivariant. What this means is that when the input is translated, the output does not simply equal a translated version of the previous output. Translation equivariance is a useful property of convolutions (especially when dealing with images!) and is one of the reasons why CNNs generalize well, so giving it up, naturally, hurts performance.

So the authors had to come up with a position encoding that’s translation equivariant, and what they did is extend relative position encoding[4] from 1D to 2D. The fact that it’s now 2D introduced memory constraints so they came up with a memory efficient implementation (extending another implementation from a different paper[5] to 2D), reducing the memory costs from O((HW)^2N_h) to O(HWd^h_k).

How exactly is the “augmentation” in attention augmentation done?

Once the attention maps are calculated, the result is concatenated with the convolution — this combined result is translation equivariant and hence promising for vision tasks on image data.

Storing attention maps is prohibitively memory-intensive still, owing to the fact that they’re global. Hence the way the networks were augmented is to start from the last layer and augment each convolutional layer with attention until they hit memory constraints. To reduce the memory footprint further, sometimes they downsampled with 3x3 average pooling and upsampled with bilinear interpolation later to maintain the same size (which is required for concatenation).

Results on Vision Tasks

AA WideResNet-28–10 outperforms Squeeze and Excite, Gather-Excite and baseline versions of WideResNet on CIFAR-100 for low-resolution image classification.
AA ResNet-50 outperforms SE, BAM, CBAM and GALA for high-resolution image classification on ImageNet, while using fewer parameters (see below table).

On ImageNet, at different scales, AA ResNet outperforms baseline and SE on every scale. In fact, AA ResNet-50 matches ResNet-101 and AAResNet-101 outperforms ResNet-152 — indicating that attention augmentation is preferable to just making the network deeper.
AA MnasNet outperforms the baseline MnasNet on ImageNet. It uses more parameters but the advantage remains even when adjusting for parameter count.
AA ResNet in RetinaNet outperforms SE and baseline on COCO object detection while using fewer parameters.

Ablation Study

The ablation study (done by varying the ratio of attentional filters to original filters in {0.25, 0.5, 0.75, 1.0} where 1.0 = fully attentional except for pointwise convolutions and the stem) shows that even when the network was made almost fully attentional, not much performance decrease was noticed, which shows that attention can be a standalone primitive. In fact, this uses 25% less parameters than the baseline! (see figure below)

Position encoding proved important when networks were more attentional, with relative position encoding being significantly better than no position encoding and also outperforming sinusoidal and CoordConv, as mentioned earlier. (see figure below)

Future Work

The main insight from this paper that could be explored more in the future is the surprising result that making the network almost fully attentional does not reduce performance much, and uses significantly fewer parameters. Future work could be done on trying to come up with a standalone self-attention primitive and coming up with architectures that are fully attentional, either manually, or perhaps by using this primitive in architecture search. This idea is partly covered in the next paper in this blog post.

Stand-Alone Self-Attention Explained

From the previous paper we have seen that attention is a promising stand-alone primitive for vision models. This paper expands upon that idea. The authors have come up with a fully attentional layer, and they have made a fully attentional network by replacing all spatial convolutional layers in a standard CNN architecture with their attentional layers. This works pretty well, in general giving comparable performance to the baseline network with fewer parameters and flops.

How does an attentional layer function?

The authors came up with a local attention mechanism, which differs from prior works performing attention which have used global attention between all pixels. This is because global attention can only be used after a large amount of downsampling as it’s computationally expensive — hence it cannot be used for ALL layers. In fact, we saw this in the previous paper — they had an almost fully attentional network because they left the stem and pointwise convolutions unchanged.

How is the attention local? Because it’s done by extracting a local region of pixels with a spatial extent k — called the memory block. You can see in Figure 3 how this works.

Once again, when computing self-attention, 2D relative position embeddings (covered earlier) are used, for the same reasons as mentioned earlier. (see figure 4).

How to construct a fully attentional architecture?

As mentioned earlier, this is done by replacing all the convolutional layers with these new attentional layers. Replacing convolutions is done by replacing any convolution with spatial extent k > 1 with an attention layer. 2x2 average pooling with stride 2 is used for downsampling when required. The rest of the structure of the network is preserved, only convolutional layers are replaced with attention layers like so. This is possibly suboptimal and the authors do catch on to this point — maybe we’d get better architectures by crafting them newly with attention as a core component — or by using architecture search. Nevertheless, we do get good results with this approach!

That covers the layers deeper in the network, but what about the early layers (which form part of the “convolutional stem” of a network? Replacing the stem is hard because convolutions easily learn edge detectors and other local features near the pixel level, which the deeper layers can take advantage of, unlike content-based mechanisms such as self-attention. In fact, the authors find that using relative self-attention in the stem underperforms compared to using convolutions. To bridge the gap, spatially-varying linear transformations are used to inject distance-based information into the 1x1 convolution. The stem then uses an attention layer on top of spatially aware value features followed by max pooling.

Results

Conv-stem + Attention gives best accuracy on ImageNet classification for ResNet-26 & 28, but fully attentional ResNet50 outperforms even that. All fully attentional models outperform the baseline with significantly less FLOPS and parameters. (see figure below)

Performs similarly to baseline in object detection on COCO, when replacing the Backbone and Detection Heads + FPN, the two parts of RetinaNet, with fully attentional models. In fact, fully attentional models match the performance of baseline with 34% fewer parameters and 39% fewer FLOPS.

Ablation Study

In ablation testing, they found that a convolutional stem performs well regardless of whether the rest of the network was convolutional or attentional. Also, for the rest of the layer groups, the best performing networks use convolution in the first few layers groups and attention in the remaining ones. (see figure below)

For spatial extent, larger k values plateau after a certain extent but small k values give lower accuracy.
Positional information is important (as we have seen earlier) and relative positional encoding performs 2% better than absolute encodings. (see figure below)

Why does attention use less parameters and FLOPS?

This is an interesting question. The answer is that the learned W_Q, W_V, W_K matrices have dimensions that depend on the number of input and output channels d_in and d_out, and do not, depend on k, the spatial extent of the local attention block. Because of this, the parameter count for attention does not depend on k! However, for convolutions, the parameter count grows quadratically with k. Of course, since 1x1 convolutions were left intact (as they can be considered as a fully connected layer), which have a lot of parameters, overall this does not result in a quadratic decrease in parameter count. But it does result in a significant decrease. In fact, the number of computations is also lower for attention, even with a higher k value than convolution — the example given by the authors is that for d_in = d_out = 128, a convolution layer with k = 3 has the same computational cost as an attention layer with k = 19. This is why we see fully attentional models having significantly fewer parameters and FLOPS than their convolutional counterparts.

Insights and future work

From these last two papers, we have learned and then reinforced the fact that stand-alone attention is a viable primitive for vision models. Also, based on the ablation studies, we find that attention is most useful in the full network, with the stem still favoring convolution. We get the best performance when the stem and early layers are convolutional, and the later layers are attentional. Also, another important insight is that positional information is important when it comes to using attention for image-based tasks — and that relative position encoding outperforms having no positional encoding and even performs significantly better than absolute position encoding.

With these insights, we have a few ideas for future work. One idea could be to see if convolution and self-attention could be unified somehow, as both have unique advantages. One issue that the authors have mentioned is that there do not currently exist optimized attention kernels for hardware accelerators, so implementing those would improve the performance of attentional models. Finally, and more concrete perhaps, is the idea that we can use this stand-alone attention layer, or something similar, to perform architecture search or try to manually craft novel fully attentional architectures for vision tasks.

References:

[1] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014

[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems

[4] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.

[5] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, and Douglas Eck. Music transformer. In Advances in Neural Processing Systems, 2018

Attention Mechanism in Vision Models

Transformers Explained

Experimental Results

CBAM Explained

Experimental Results

Attention Augmentation Explained

Stand-Alone Self-Attention Explained

Written by Arvind