A Survey of Attention Mechanism and Using Self-Attention Model for Computer Vision

Swati Narkhede

Published in

The Startup

7 min readNov 1, 2020

Original Survey Authors: Sneha Chaudhari , Gungor Polatkan , Rohan Ramanath , Varun Mithal

1. Introduction

The Attention Model was first established for Machine Translation gained massive popularity in the Artificial Intelligence community. Over the years, it has become a significant part of neural network architecture for various Natural Language Processing, Statistical Learning, Speech Recognition, and Computer Vision applications. There has been a rapid advancement the attention modeling in neural networks because these models are state-of-the-art for multiple tasks. They can interpret neural networks and overcome the limitations of Recurrent Neural Networks (RNNs).
The functioning of tasks like Vision Processing, Speech, Translation, Summarization in the human biological system is the best suitable example of an idea behind the Attention mechanism. These systems only focus on a relevant part of input useful for getting the required knowledge or working on a task and ignoring irrelevant details. The Attention model integrates this concept of relevance by focusing only on the relevant aspects of the given input, which is useful for a compelling performance of the task.

2. Use of Attention to overcome drawbacks of traditional Encoder-Decoder

An encoder-decoder architecture is a part of a sequence to sequence model. Encoder-Decoder are Recurrent Neural Networks (RNN), in which encoder takes inputs sequence {x1, x2, x3,….., xT } of length T and encodes it to a series of fixed-length vectors {h1, h2, h3,….., hT } also referred to as context vector. The decoder takes these fixed-length vectors as input and generates output sequence {y1, y2, y3,….., yT }. In this process, the encoder compresses the long input sequence and stores then in a single fixed-length vector, leading to a loss of information. The decoder cannot focus on the relevant input tokens to generate output. Passing fixed-length context vectors to the decoder limit it from aligning the input sequence to the output sequence.

(a) Encoder-Decoder Architecture and (b) Encoder-Decoder Architecture with Attention Mechanism

As a solution to this drawback of Encoder-Decoder architectures, the attention model induces weights α on the decoded input sequence, which gives an idea of the relevance of each token of context vector (candidate state) to the tokens of input vector (query state) of the decoder. These attention weights αi are used to build the context vector to input decoder. It enables the decoder to access the entire decoded input vector to generate the output. Each token of this context vector is a weighted sum of all candidate states of the encoder and their respective weights.

3. Categories of Attention

There are four broader categories of attention models. Each category has different types of attention models. Although these categories are mutually exclusive, we can apply attention models using a combination of different categories. Hence, these categories can also be considered as dimensions.

a. Number of Sequences:

In Distinctive Attention models, the candidate state and query states from encoder-decoder belong to two distinct input and output sequences. Co-attention models have multiple input sequences at the same time. Attention weights are learned based on all the input sequences. Co-attention models can be used for image inputs. In recommendations and text classification problems, the input is in sequence, but the output is not a sequence. For such issues, Self-Attention is used where candidate state and query state both belongs to the same input sequence.

b. Number of Abstractions:

The attention models having single level abstractions compute attention weights just for the original input sequence. The multi-level abstraction attention models apply attention on multiple levels of abstraction on the input sequence. In this type of attention, the lower abstraction level’s context vector becomes the query state for high-level abstraction. Such models can be classified further as top-down or bottom-up models.

c. Number of Positions:

In this category of attention models, the models are further classified based on the input sequence position where attention function is calculated. In Soft Attention models, the context vector is computed using the weighted average of all hidden stages of the input sequence. These models enable the neural network to learn from backpropagation efficiently. However, it leads to quadratic computational loss. In hard attention models, the context vector is built using hidden states which are stochastically sampled in the input sequence. The global attention model is similar to the soft attention model, whereas the local attention model is midway between soft and hard attention mechanisms.

d. Number of Representations:

The Multi-Representational attention models determine different aspects of the input sequence through multiple feature representations. The weight importance is assigned to these multiple feature representations using attention to decide which aspects are most relevant. In multi-dimensional attention, the weights are generated to determine the relevance of each dimension of the input sequence. These models are used for natural language processing applications.

4. Network Architectures with Attention

Below list of Neural Network Architectures is used in combination with the attention models.

a. Encoder-Decoder:

The ability of Attention models to separate the input representations from output enables one to introduce hybrid encoder-decoders. The popular hybrid encoders-decoders are the ones in which Convolutional Neural Network (CNN) is used as an encoder, and Long Short-Term Memory (LSTM) is used as a decoder. This architecture is useful for image and video captioning, speech recognition, etc.

b. Memory Networks:

For some applications like Chatbots, input to the network is knowledge database and query, having some facts more relevant to the query than others. For such problems, the end to end memory networks uses an array of memory blocks to store the database of facts and use the attention models to determine the relevance of fact to answer the query.

5. Applications

Attention models is an active area of research. I will discuss the application of attention modeling in four domains:

a. Natural Language Generation:

Natural Language Generation domain involves tasks in which natural language texts are generated as outputs. Machine translation, question answering, and multimedia description are applications from Natural Language generation, which benefit from using attention models.

b. Classification:

Multi-level, multi-dimensional, and multi-representational self-attention models are used for the task of document classification. Sentiment classification can also be performed with the use of attention models.

c. Recommender System:

Attention mechanism is widely used in recommendation systems for user profiling. It is used to assign attention weights to the items the user has interacted with to capture users’ long-term and short-term interest. Self-attention mechanism is used for this task.

d. Computer Vision:

Attention models are used for various problems in Computer Vision like Image Captioning, Image Generation, Video Captioning. Many works augment self-attention models with Convolutional Neural Networks (CNNs) for computer vision tasks.

6. Stand-Alone Self-Attention Model for Computer Vision

Convolutional Neural Networks (CNNs) have gained lots of popularity in the Computer Vision domain. It is considered a building block of the computer vision architectures. CNN’s have low scaling properties concerning the large receptive fields. Hence, it uses attention models to capture long-range interactions. The attention model is always used on top of other networks for computer vision tasks. Therefore, a group of researchers built a fully stand-alone self-attention vision model. This model was built by replacing all instances of spatial convolutions from an existing convolutional architecture with a form of self-attention applied to ResNet model and by replacing the convolutional stem.

7. Experiments performed using Stand-Alone Self-Attention Model

a. ImageNet Classification:

The researchers experimented on ImageNet Classification task containing 1.28 million training images and 50000 test images. They replaced the spatial convolutional layer with a self-attention layer and used position-aware attention stem. The attention models outperform the baseline across all depths.

Comparison of the Full Attention model performance with a model with Convolution baseline on ResNet for classification task.

b. COCO Object Detection:

The stand-alone self-attention model was evaluated on COCO Object Detection task using RetinaNet Architecture. The researchers used the attention-based backbone in RetinaNet. A fully self-attention model performed efficiently across all vision tasks.

Comparison of the Full Attention model performance with a model with RetinaNet baseline for COCO object detection task

8. Conclusion

This article has provided high-level information about Attention models, Architecture, types, and applications of Attention Models. Along with that, I have an overview of a Stand-Alone Self Attention model for Computer Vision. For more detailed information, kindly refer to the original article links provided below.