Attention Mechanisms in Vision Models

Himanshu Arora
Jumio Engineering & Data Science
9 min readDec 8, 2020

Neuroscience and Machine Learning maintain a continuous exchange of ideas. Many innovations in machine learning are modelled on phenomena in neuroscience and vice-versa. For example, in 1961, Hubel et. al. conducted an experiment to determine how different parts of a cat’s visual cortex respond to varying patterns of light. This inspired modern-day convolutional neural networks where each filter acts as a biological neuron and is activated upon ‘seeing’ certain patterns in an image.

Another well-studied mechanism in psychology and neuroscience making an impact in machine learning is attention. This mechanism describes the brain’s ability to focus on certain sensory input while discarding irrelevant information. For example, while reading a book your brain focuses on the text and blurs out the background. This idea is so simple and powerful that it is widely used across many different tasks and domains in machine learning. Consider the case of face recognition, where the background of a portrait has nothing to do with the identity of the face. Filtering out noise from the input can make the model more accurate as it only needs to process relevant information. This phenomenon can be generalized to individual layers of a neural network providing better learning throughout the network.

Implementations

Many options are available for implementing attention; the key commonality among them is producing a tensor of importance weights for every element of the input. The input is then re-weighted with these values before applying a convolution. The attention weight for each element is between 0 and 1, where 0 means no attention at all and 1 means full attention. This is known as soft attention. A sigmoid or a softmax activation is generally applied for this purpose. Note the output of a softmax activation is relatively sparse and is used when picking the topmost important elements.

From left to right: Input Image, soft-attention output, and hard-attention output. Note how soft-attention smoothly eliminates irrelevant information whereas hard attention either completely keeps or completely removes information from the image.

Another possible output is one in which the attention weights are binarized. This is known as hard attention and it completely eliminates irrelevant information from the input. Hard attention is not as popular as soft attention in practice, however, because it is not differentiable and requires additional tricks to work.

Self-Attention

Attention can be computed between different tensors (as is generally done in machine translation), but this post focuses on self-attention. Using this technique, attention is computed on input feature maps with respect to themselves. Self-attention is exhaustive in nature; each pixel of an input feature map has an associated array of attention weights for every other pixel in the map. This form of attention is particularly useful in modelling long-term dependencies.

Self-attention module in SAGAN [2018] by Zhang et. al.

Self-attention gained popularity through natural language processing where long-term dependency modelling increases in difficulty as text sequences increase in length. These dependencies provide useful correlations in vision as well. For example, a face recognition model could benefit from the relative information of the eyes to recognize that there is a face in the image. These dependencies may be modelled in CNNs by increasing the kernel size to incorporate a larger region of the input or increasing the depth of the network to increase the receptive field. Unfortunately, these approaches become inefficient as the number of parameters continues to increase.

Computing self-attention draws inspiration from retrieval systems, such as search engines. These systems map textual user queries to keys like the title or the description of a website then display the matched pages or values to the user. Similarly, in self-attention, an input tensor of size C x H x W is mapped to three latent representations: Query, Key, and Value. The query and the key can be of an arbitrary hidden dimension such as C′. A dot product is used as a similarity measure between the key and the query, and a dot product over the channels yields a tensor of size H x W x H x W containing attention weights for all combinations of the input elements. This is then multiplied with the value representation to get the final C x H x W attended feature maps. Note that, in contrast to retrieval systems, rather than a user issuing a query and a system matching keys and returning values, the model decides these depending on the input.

A monumental paper by Google Brain’s Vaswani et. al., Attention is all you need [2017], popularized this form of self-attention. Their model, known as the Transformer, achieves better performance by, among other things, using self-attention as a replacement for recurrent neural networks in sequence modelling. A transformer model applies self-attention in parallel across different groups of hidden dimensions in the input, which they call Multi-head self-attention.

Self-Attention kernel in Stand-alone Self-Attention [2019] by Ramachandran et. al.

Google Brain’s Ramachandran et. al. later applied a similar approach to CNNs in their paper, Stand-Alone Self-Attention in Vision Models [2019]. Their approach applies self-attention kernels to a ResNet model in place of convolutional kernels. This achieved a 0.7% accuracy gain on the ImageNet dataset as compared to the vanilla ResNet-50 model. They achieved this gain while simultaneously reducing the number of computations by over 1 billion FLOPs and reducing the parameter count by over 7.6 million. They also perform experiments that suggest that convolution operations may better capture low-level features while stand-alone attention layers may better integrate global information. This leads to a combined approach, employing architectures that combine the comparative advantages of both.

Although theoretically more efficient, the stand-alone self-attention network takes longer to train in practice than vanilla ResNets. This is primarily due to the highly-optimized nature of convolutional kernels when operating on commonly available hardware. Because of its potential, however, optimizing self-attention is currently a hot topic of research, as evidenced from the popularity of recent papers such as Linformers [2020], Performers [2020], and Visual Transformers [2020]. Linformers and Performers approximate the self-attention weight matrices to reduce the time complexity from quadratic to linear. On the other hand, Visual Transformers focus on the practical side by adapting the standard and already efficient Transformer model to vision with fewest possible changes.

Channel and Spatial Attention

Another way to compute attention is to determine exactly one attention weight for each element of the size C x H x W input. This approach can be seen as layer-focused, identifying which pixels each layer should focus on. This contrasts with the previous pixel-focused option of self-attention, which identifies which other pixels each individual pixel correlates with.

Chen et. al. introduced a clever solution to reduce the complexity of this approach in their paper Spatial and Channel-Wise Attention in CNNs [2017]. Rather than generating a single attention weight tensor, they factorize it into two tensors: channel attention and spatial attention. First is a vector of 9attention weights of size C for each feature map, and the second is a 2D matrix of size H x W containing weights for each position in the feature maps. These attention tensors are broadcast to the size of the input and multiplied with it. The input is enriched with channel attention and then spatial attention sequentially. The factorization makes this approach much cheaper in computation than self-attention. Channel attention denotes what features to look for since each feature map depends only on one filter in the convolution. Spatial attention, meanwhile, signifies where to focus while processing an image.

SCA-CNN [2017] by Chen et. al. uses a single fully-connected layer for each of the channel and spatial attentions. The input tensor is first passed through a global average pooling layer before computing channel attention.

Squeeze-and-Excite Network (SE-Net)

SE-Net [2017] by Hu et. al. is a channel attention module heavily inspired by the channel attention in the SCA-CNN architecture. An SE attention block first applies global average pooling on the input tensor, much as SCA-CNN does. This is called a squeeze operation as it reduces a C x H x W tensor to C x 1 x 1. This tensor acts as a global summary of information present in each channel. The tensor is then excited, or activated, using a fully-connected bottleneck architecture. This is accomplished by reducing the tensor to C/r x 1 x 1 (r > 1) and then restoring it to C x 1 x 1 using two fully-connected layers. A sigmoid activation on the resulting C x 1 x 1 tensor provides the final channel attention map. The bottleneck structure of the SE blocks yields an additional parameter complexity of C/r x C + C x C/r = 2C²/r. The upshot of this added complexity is an accuracy improvement of 1.51% over the ImageNet dataset when using a ResNet-50 variant with SE blocks.

An SE block with squeeze, excite, and scale operations

Convolutional Block Attention Module (CBAM)

CBAM [2018] by Woo et. al. also mirrors SCA-CNN by applying channel attention followed by spatial attention on the input tensor.

CBAM follows the same overall architecture as SCA-CNN

Where the approach diverges is by adding max pooling in addition to average pooling to determine attention weights. This addition preserves edge features to contrast with the blurred effect of average pooling. CBAM then employs the fully-connected bottleneck structure discussed with SE-Net to derive channel attention weights from the pooled features.

Global Max Pooled feature vectors are additionally used to compute channel attention

To reduce the complexity in computing spatial attention, CBAM also uses both max and average pooling to get the channel-pooled features. That is, it applies average and max pooling across the channel dimension to get a 2 x H x W tensor. It then uses a 2D convolution to capture local spatial interaction, “mixes” the average and max pooled information, then outputs a single H x W matrix of spatial attention weights.

Efficient Channel Attention (ECA)

ECA-Net [2020] by Wang et. al. is a channel attention block that improves upon the SE block by raising two important points:

  1. The bottleneck structure leads to inaccurate channel attention prediction.
  2. Using a fully-connected layer to model the interaction between all of the channels is inefficient and unnecessary.

The paper describes a new mechanism, Efficient Channel Attention, to address these issues. Swapping the bottleneck structure for a single layer solves the first problem. Modelling local cross-channel interactions with 1D convolution rather than fully-connected neurons addresses the second. This elegant design requires a negligible increase in parameters but gains over 2% average improvement in model accuracy over a vanilla ResNet-50 model.

An ECA block uses only global average pooling and 1D convolutions to efficiently compute channel attention

Conclusion

The performance gains of attention mechanisms provide a critical avenue of optimization in image processing algorithms. The relative simplicity of implementing attention is a major part of their appeal. The performance gains over vanilla models are very promising, and the field is evolving quickly with each new approach furthering the state-of-the-art in important ways.

ImageNet-1k classification performance comparison of attention mechanisms on a ResNet-50 model. Data Source: ECA-Net [2020]

The approaches to computing attention are varied and readily applicable to multiple problem domains. Continued research will see their computational efficiency come to match their algorithmic efficiency. This improved efficiency and accuracy will lead to important gains in utilizing larger datasets with reduced computational costs, extending the utility of these networks.

--

--