The Attention Mechanism in Computer Vision: Explained for the ‘non-expert’.

This article will outline what attention is, its roots and uses in other fields like psychology and deep learning, and how it is used in computer vision. The mathematics and lower-level details are left out to ensure an easier read for the reader.

7 min readJul 23, 2022

Attention in Cognitive Psychology: Where it all began…

In cognitive psychology, attention is defined as a cognitive process used to selectively concentrate on the most discrete and salient aspects of information and human attention is considered a combination of bottom-up primary visual features with top-down guidance. Human cognitive attention has been studied extensively by cognitive psychologists, with our modern understanding of attention dating back to the latter part of the nineteenth century. With behavioural psychology dominating much of the early 20th century much of the understanding of the internal mechanisms involved in selective attention was ignored, until major advancements in neuroimaging technology gave scientists a better understanding of the neural systems in the brain that are involved in visual attention.

Attention models in Cognitive Psycology: Broadbent and Triesman

This cognitive process was first computationally modelled by Broadbent and has an attention filter mechanism, which ‘glimpses’ at the low-level physical features of the sensory data in STM (Short-term memory) and attends to salient regions. See the Broadbent diagram below for an illustration.

Broadbent’s model of cognitive attention.

Experiments like the ‘cocktail party effect’ and the ‘Dear Aunt Jane’ experiment demonstrate that for humans, higher-level properties like semantic meaning and context of the data were attended to in the glimpse suggesting a more complex attention mechanism like that proposed by Treisman, shown in the diagram below.

Treisman’s model of cognitive attention.

The attenuator analyses data on a hierarchical basis, where physical low-level characteristics are at the top of the hierarchy and higher level-characteristics like meaning and language are hierarchically lower and are only processed if the low-level characteristics are unclear. The attended data (green arrow) and the unattended data (red arrow) are evaluated using the ‘dictionary’, where each possible datapoint has value associated with it that is needed to be met to be activated and not lost. This value varies and would be small for an auditory data point like one’s own name but high for a word that is very irrelevant. In summary, these models can be categorised in to ‘early selection models’ (like Broadbent’s model) and ‘late selection models’ (like Treisman’s model).

Attention in Deep Learning: The next step…

The use of attention in deep learning is rooted in the understanding of attention in the field of cognitive psychology. In deep learning, implicit attention is a mechanism not specifically programmed, but observable in all deep neural networks. The implicit attentive properties are clear as the parametric/non-parametric function of the network learns to respond more enthusiastically to salient regions/items of the data. Attention mechanisms are used extensively in the tasks of sequence modelling and machine transduction problems. RNN (Recurrent Neural Network) models are combined with an attention mechanism, typically as an encoder-decoder model with an attention mechanism, that directs the RNN’s focus onto certain parts of the input sequence, which allows for better performance for longer input sequences. For instance, the transformer network that is solely based on attention mechanisms and is now the preferred solution for NLP (Natural Language Processing) tasks, showing great performance in tasks such as machine translation, text classification, and document summarisation. The transformer network emphasizes only using attention to repeatedly transform a complete sequence, rather than relying on recurrence as in RNN models. The visual transformer (ViT) for vision tasks, that is based on the same transformer architecture used for language tasks, where self-attention mechanisms learn the complex dependencies occuring in the input.

Now, we have arrived at Attention in Computer Vision…

The General Attention Function

The innate attention mechanism present in the human cognitive process has heavily inspired the research and incorporation of such ideas in computer vision systems. In a vision network, an attention mechanism is essentially a dynamic weight adjustment function based on an attention function g(x) and an input feature map x that is superimposed between the convolutional layers. Its role is to tell the next layer of the deep network which features are more or less important. This function is shown below.

The terms: Channel, Spatial and Temporal

One way attention mechanisms are categorised are according to their data domain as follows. What to attend to (Channel), where to attend to (spatial), when to attend to (temporal), and which to attend to (branch) are four fundamental categories of attention mechanisms in computer vision. A further two categories; what and where to attend to (channel and spatial), and where and when to attend to (spatial and temporal) also exist. It is also important to explore a sub-category of the spatial attention mechanisms called self-attention-based attention mechanisms, such as visual transformer networks (ViT), which have become very popular in the recent few years. See the diagram below for a visual demonstration of the axis within the feature tensor that each attention category discussed relates to.

Data domains that different attention mechanisms operate on.

The terms: Soft vs Hard and Location-wise vs Item-wise

Conversely, another way you might see the attention mechanisms categorised (although these are more specific to RNN-models) are as: item-wise soft attention, item-wise hard attention, location-wise soft attention, location-wise hard attention. Attention mechanisms that are ‘hard’ involve adding a weight mask between the input and output layers, which forces the network to focus on the content of the image that needs attention. This weight mask is a matrix of values that is manually fixed based on the sampling probability of each location in the image. Hence hard mechanisms are computationally non-differentiable, meaning the deep NN cannot update its parameters through backpropagation algorithms and so must be trained through reinforcement learning. Hard attention mechanisms use sampling probabilities, which break down when the input image is already sampled, hence are counter-productive for sampled images. Mechanisms can be categorised as ‘soft’ if their learnable parameters are trained through gradient descent and back-propagation algorithms. Soft mechanisms consist of a matrix of weights that are differentially computed through back-propagation of the entire network, or as a separate training task of the weight model. Attention mechanisms in RNN-models are more appropriately categorised as ‘location-wise’ if the input is the entire input feature map, whereas one in the category of ‘item-wise’ operates on explicit items in the input. For computer vision tasks involving images, this requires an additional step to extract items from the image. Typically, for object detection tasks, the use of location-wise hard attention mechanisms is most appropriate.

A literature Review of attention in computer vision

The channel attention mechanism was pioneered with SENet (Squeeze-and-excitation networks) in 2017, and since then there has been re-constructions of the inner-mechanism of both the squeeze and excite modules that have improved its attentive performance. SENet uses GAP (Global Average Pooling) to infer global spatial information, the use of GAP has been critisised since, as it cannot model higher-order statistical global information. One paper replaced SENets squeeze module (GAP → FCL → ReLU) with a GSOP (Global second order pooling) mechanism that it a sequential combination of a two-dimensional convolution, a covariance pooling layer, and a row-wise convolution layer (2D Conv → Cov-Pooling → Conv). Another paper proposed a style pooling mechanism that combines GAP with ‘global standard deviation pooling’ to adaptively predict the channel-wise importance of feature maps better than GAP does when used alone. Hence there is a strong case that mechanisms that use higher-order statistical information are often better than the standard GAP mechanism at learning the relative importance of channels than SENet attention. Another widely referenced attention mechanism is CBAM, which works on both the channel and spatial domain, and uses encodings using convolutions with large kernel sizes to model the spatial information. As demonstrated in the figure demonstrating the categories of attention, the image surface and its dimensional aspects are what are considered in channel and spatial combined mechanisms. Furthermore, channel and spatial attention modules are stacked in series and the channel module in CBAM uses GAP like SENet does to collect global information, however, is more complex as it also uses other pooling operations as well. CBAM also uses GAP to infer global spatial information, whereas BAM uses dilated convolutions instead to do this task.