Selective Attention: The Key to Unlocking the Full Potential of Deep Learning

Published in

Artificial Corner

11 min readMay 7, 2023

State-of-the-art Applications

Attention mechanisms have become crucial to natural language processing (NLP) and computer vision (CV) in recent years. It enables machines to selectively focus on parts of the input most relevant to the task. Attention mechanism has been extensively used in transformers, a type of neural network architecture that has become state-of-the-art for many NLP tasks such as language translation, question answering, and sentiment analysis.

In transformers, attention mechanisms enable the models to look at all the tokens in the input sequences and selectively attend to the most relevant ones when generating the output sequences. This attention-based approach has shown to be highly effective in capturing long-term dependencies and improving the quality of the generated output.

For example, in natural language translation, the attention mechanism allows the AI model to focus on the most relevant words in the source language sentence when generating the corresponding words in the target language sentence. Similarly, in image captioning, the attention mechanism enables the AI model to selectively attend to the most relevant parts of the image when generating the corresponding caption.

Dot Product Attention

Dot product attention is an attention mechanism that computes the attention weights as the dot product of the decoder state and the encoder states. In dot product attention, the alignment score function is calculated as follows:

where h refers to the hidden states for the encoder, and s is the hidden states for the decoder.

Computationally, given a decoder state and a set of encoder states, the attention weights are computed as the dot product between the decoder state and each encoder state, followed by normalization of the scores. It is equivalent to multiplicative attention without a trainable weight matrix.

Dot product attention can be used to improve the accuracy of outputs for tasks such as sentiment analysis and question answering. One example of its application is in the GPT-3.5 (GPT stands for Generative Pre-trained Transformer). GPT-3.5 uses dot product attention in its transformer architecture to enable the model to attend to the most important words in the input sequence when generating the corresponding words in the output sequence. This allows the model to generate high-quality text that is coherent and consistent with the input context.

Scoring, Normalization, and Convenience Functions

In the context of the attention mechanism of a transformer, a scoring function determines how vital each input token is to the output at each time step. The scoring function computes a score for each token based on its similarity to the current decoder state and then normalizes the scores using a softmax function. The resulting attention weights are then used to compute a weighted sum of the input tokens, which, in turn, is used to compute the context vector for the decoder.

The softmax function ensures that the attention weights computed by the scoring function form a valid probability distribution over the input tokens. This allows the model to selectively attend to the most relevant parts of the input sequence and suppress irrelevant information.

Mathematically, the softmax function is typically defined as follows:

Where x is a vector of scores computed by the scoring function, and exp(x) is the element-wise exponential function. The denominator of the softmax function is the sum of the exponential scores, which ensures that the resulting attention weights sum to 1.

The softmax function has several desirable properties that make it a good choice for normalizing the attention scores:

It is a smooth and continuous function that is differentiable, making it easy to incorporate into the training of machine learning models.
It produces a valid probability distribution over the input tokens, which is necessary for the attention mechanism to work effectively.
It is computationally efficient and can be easily implemented using standard numerical libraries.

While the softmax function is the most commonly used normalization function in the attention mechanism, there are other functions that can be used to achieve a similar effect. For example, the sigmoid function can be used to normalize the scores, although it has a different range of values and is not as effective at suppressing irrelevant information. Other functions, such as the log-softmax function and the softplus function, have also been used in some research studies, although they are less commonly used than the softmax function.

Taking language translation by a transformer as an example, scoring functions enable the model to focus on the most relevant words in the source sentence when generating the corresponding words in the target sentence. This selective attention mechanism enables the model to capture long-term dependencies between the input and output, consequently improving the translation quality.

Both softmax and linear transformations are also known as convenience functions, as they can help to simplify the computation of attention scores in transformers. Compared to softmax, a linear transformation typically involves multiplying the input representations with a learnable weight matrix, followed by a bias term. This enables the model to learn a linear transformation of the input representations specific to the task at hand.

One example of an application of convenience functions in transformers is in the BERT (short for Bidirectional Encoder Representations from Transformers). BERT uses both softmax and linear transformations in its attention mechanism to enable the model to attend to the most relevant parts of the input sequence when computing the representation of the sentence. This approach has shown to be highly effective in improving the accuracy of tasks, such as question answering and named entity recognition.

Scaled Dot Product Attention

Scaled dot product attention is a variant of dot product attention that uses a scaling factor to prevent the dot product from exploding as the dimensionality of the input increases. Computationally, the attention weights are calculated as the dot product between the decoder state and the encoder states, divided by the square root of the dimensionality of the input. For a mini-batch of n queries and m key pairs, the corresponding softmax function can be expressed as follows:

Figure 1 illustrates the flow of actions in a simplified example where only one query is presented.

**Figure 1**. A hypothetical scheme where the output of attention pooling is calculated as a weighted average of values while the weights are computed with the attention scoring function α and normalized by the softmax operation (drawn by the author).

Scaled dot product attention is widely used in transformers, not only because it can capture the most relevant information from the input sequence effectively but also because it can effectively suppress numerical instability. In transformers, scaled dot product attention is typically used in both the encoder and decoder layers.

One example of an application of scaled dot product attention in transformers is in the Transformer-XL. Transformer-XL uses scaled dot product attention in its architecture, enabling the model to generate high-quality and coherent text consistent with the input context. Another example of an application of scaled dot product attention in transformers is in the T5 (short for Text-To-Text Transfer Transformer). T5 uses scaled dot product attention in its architecture to enable the model to generate high-quality outputs in various tasks such as summarization, question answering, and language translation.

Additive Attention

Additive attention is a type of attention mechanism that computes the attention weights as the sum of the decoder and encoder states, followed by a non-linear activation function such as the hyperbolic tangent. Specifically, given a decoder state and a set of encoder states, the attention weights are computed as the sum of the decoder state and each encoder state, followed by a non-linear activation function.

Additive attention is less commonly used in transformers than dot-product attention and scaled dot-product attention. However, it has been shown to be effective in capturing non-linear relationships between the input and output.

In transformers incorporating additive attention, such as Sparse Transformer, it is typically deployed in both the encoder and decoder layers. This enables the model to reduce the computational complexity while still attending to the most relevant parts of the input sequence and achieving state-of-the-art performance.

Another example is the Reformer model. Reformer uses additive attention in its transformer architecture to enable the model to reduce the attention mechanism’s memory requirements yet achieve high-quality performance in language modeling.

From NLP to Computer Vision

Before the advent of attention mechanisms, many machine learning models relied on fixed-length input representations such as bag-of-words or fixed-length feature vectors. While these models were effective in some applications, they could not capture the complex and dynamic relationships between the input and output.

For example, traditional models such as n-gram models, hidden Markov models, and conditional random fields were widely used in NLP. However, they were limited in capturing long-range dependencies between words in a sentence. As a result, they often struggled with tasks such as language translation and sentiment analysis, where understanding the context and relationship between words is crucial.

Introducing attention mechanisms in models such as the transformer represented a major breakthrough in NLP, enabling models to selectively attend to the most relevant parts of the input sequence and capture long-range dependencies between words.

Similarly, in computer vision, traditional models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) were effective in some applications. However, they struggled with tasks such as object recognition and image captioning, where understanding the context and relationship between different parts of an image is crucial.

Introducing attention mechanisms in models such as the Vision Transformer (ViT) and the DenseNet-based Spatial Attention Networks (SAN) has enabled these models to achieve state-of-the-art performance on various computer vision tasks.

To understand how models such as ViT and SAN have outperformed traditional models such as CNNs and RNNs, let’s briefly examine their respective underlying architectures.

In traditional CNNs, the input image is passed through a series of convolutional and pooling layers to extract features at different scales. These features are then passed through one or more fully connected layers to generate the final output. One weakness of such a CNN architecture is its limited capability in object recognition and image-captioning, where it is vital to understand the context and relationship between different parts of an image.

Incorporating an attention mechanism into the architecture of a CNN has addressed this limitation. For example, in SAN, the attention mechanism is used to compute spatial attention maps that allow the model to focus on the most informative regions of the input image.

Similarly, in the ViT architecture, the input image is first divided into a grid of patches, which are then passed through a sequence of linear projections to generate a sequence of feature vectors. The attention mechanism is then used to compute the relevance of each patch to the final output, allowing the model to selectively attend to the most relevant patches.

Mathematically, the attention mechanism in ViT and SAN can be expressed as a weighted sum of the input features, where the weights are computed using a softmax function applied to a scoring function. The scoring function can be based on the dot product, Euclidean distance, or other distance measures, depending on the specific application.

Figure 2 illustrates the principle of a ViT architecture where the attention mechanism is implemented via the form of multi-head attention.

**Figure 2**. An illustrative architecture to implement the attention mechanism in a ViT (drawn by the author).

One advantage of the attention mechanism in these models is that it allows the model to capture long-range dependencies between different parts of the input, which is crucial for tasks such as image captioning and object recognition. Moreover, the attention mechanism is able to focus on the most informative regions of the input, reducing the impact of noisy or irrelevant features.

In summary, the attention mechanism has enabled models such as ViT and SAN to achieve high-quality performance on a wide range of computer vision tasks by allowing them to selectively attend to the most relevant parts of the input image and capture long-range dependencies between different parts of the input. By incorporating attention mechanisms into traditional CNNs and RNNs, researchers are able to build more powerful and accurate models that can solve increasingly complex problems in computer vision.

Limitations

Despite its many advantages, it is important to note that the attention mechanism is not a silver bullet that can solve all problems in deep learning. Not only it may not be effective in all types of datasets or for all types of tasks, but also its performance may be impacted by factors such as the size of the input sequence and the complexity of the model architecture. Like any tool, its limitations require careful consideration when designing and implementing deep learning models.

For example, one of the challenges is the computational complexity of attention-based models, which may require significant resources to train and deploy. Consequently, explorations of new techniques, such as sparse attention and low-rank approximation, are called for to reduce the computational cost associated with the attention mechanism.

Another one of the challenges is the interpretability of attention-based models. While the attention mechanism can provide insights into which parts of the input are most relevant for a given output, it can be challenging to interpret and understand how the models arrive at their overall predictions. To address this challenge, new techniques, such as attention visualization and attention attribution, may need to be explored to understand how attention-based models arrive at their predictions and make them more trustworthy and explainable.

Future Opportunities

With the increasing availability of large-scale datasets and powerful computing resources, attention mechanisms will continue to be refined and optimized, leading to even more impressive results on various tasks. Researchers are also exploring new directions in attention mechanism research, such as multi-head attention and self-attention, which have the potential to further improve the performance of machine learning models.

Furthermore, the attention mechanism can be used in conjunction with other techniques, such as transfer learning and pre-training, to further improve the performance of computer vision models. For example, ViT can be pre-trained on large-scale image datasets using techniques such as contrastive learning or self-supervised learning, allowing it to learn general features that can be fine-tuned for specific tasks.

Similarly, SAN can be combined with transfer learning techniques, such as fine-tuning pre-trained CNNs, to improve their performance on specific tasks. By incorporating attention mechanisms into these pre-training and transfer learning techniques, we should be able to build more effective and efficient models that can learn from large-scale datasets and achieve state-of-the-art performance on a wide range of computer vision tasks.

In addition to its applications in NLP and computer vision, attention mechanisms can also be explored in other domains such as speech recognition, finance, healthcare, and climate science. For example, the attention mechanism is used in speech recognition models to selectively attend to the most informative parts of the audio signal and improve accuracy.

Overall, the attention mechanism has revolutionized the field of deep learning. By addressing the challenges and limitations of attention mechanisms and exploring their full potential in different domains, we should be able to unlock new opportunities and achieve even greater advancements in artificial intelligence.

Allow me to take this opportunity to thank you for being here! I would be unable to do what I do without people like you who follow along and take that leap of faith to read my postings.

If you like my content, please press the “Follow” button at the upper-right corner of your screen (below my photo). Once you have pressed the button, Medium will remember to update you on my publications in real time. I can also be contacted on LinkedIn, Facebook, or Twitter: Twitter.

Attention Mechanisms in Transformers | Artificial Corner (medium.com)