Attention Mechanisms in Transformers

A Deep Dive into Queries, Keys, Values, and Pooling Techniques

Eric S. Shi 舍予

Published in

Artificial Corner

9 min readApr 30, 2023

What Is Attention Mechanism?

Ever since the arrival of Chat GPT (GPT = Generative Pretrained Transformer), transformers, as Large Language Models (LLMs), have outshined other AI models on earth. While public hype focuses on GPTs’ surprising capabilities and threats to human employment/employability, fewer have paid attention to the techniques under the hood that enabled the outstanding performance of GPTs, and their consequential limitations.

The attention mechanism is clearly one of the vital underlying techniques that enabled the outstanding performance of GPTs. In fact, it is a critical component of many deep learning models, especially in natural language processing (NLP). It enables the AI models to focus on relevant parts of the input sequences, generating accurate predictions. The attention mechanism has become increasingly popular in recent years due to its ability to improve the performance of NLP tasks such as language modeling, machine translation, and question-answering.

So, what is the attention mechanism? In the context of our daily life, attention is the cognitive process of selectively concentrating on one or a few things while ignoring others. Suppose you are given a photo of a group of kindergarten kits playing in a room, and someone asks you, “ How many people are there? How would you derive your answer? You count heads. You don’t need to consider any other things, such as the clothes the kit wear, the furniture in the room, or the colors of the wall. Now, if anyone asks a different question — — Who is the teacher? Your eyes will immediately search for the adult in the photo, right? The rest of the features will be ignored, right? This is the attention mechanism that the human brain is very adept at implementing.

For a neural network, the attention mechanism is a computational realization of the same process of selectively concentrating on one or a few things while ignoring others. This post provides an overview of the attention mechanisms used in transformer models, covering the concepts of queries, keys, and values, attention pooling by similarity, attention pooling via Nadaraya-Watson regression, and adapting attention pooling to different applications. The examples will demonstrate how attention mechanisms have revolutionized the field of NLP and image classification, and they will continue to be an essential area of research in the future.

The importance of attention mechanisms in GPTs cannot be overstated. With the rise of deep learning and the availability of large amounts of data, attention mechanisms have become a crucial tool for improving the performance of models on a wide range of NLP tasks. By enabling models to focus on relevant parts of the input sequence, attention mechanisms have significantly improved the accuracy of predictions and have become a standard technique in many transformer models.

Queries, Keys, and Values

Queries, keys, and values are the fundamental concepts of the attention mechanism in transformer models. The attention mechanism compares the queries with the keys to generate a similarity score, which is then used to weigh the values. The weighted values are then summed to generate the output of the attention mechanism.

Suppose that we have a simple database D consisting of tuples of keys (k) and values (v), e.g., {(“McDonald”, “Brent”), (“Smith”, “Arron”), (“Wang”, “Ke”), (“Sydney”, “Alex”), (“Hinton”, “Zachary”), (“Sato”, “Rachel”)} with the last name being the key and the first name the value. We can operate on D, with the exact query (q) for “Smith” which would return the value “Arron”. In case (“Smith”, “Arron”) was not a record in D, there would be no valid answer. If we also allowed for approximate matches, we would retrieve (“McDonald”, “Brent”) instead.

Over a generic database DB,

we can define the attention over DB as

where α(q, ki) ∈ R(i = 1, …, m) are scalar attention weights. This operation is typically referred to as attention pooling. The name pooling derives from the fact that the right side of the equation is a summation; the name attention derives from the fact that the outcome is mainly dictated by (as if paying particular attention to) the terms for which the weight α is large. As such, the attention over DB generates a linear combination of values contained in the database.

Figure 1 expresses the above stated operation in a flow-chart diagram.

**Figure 1**. Illustration of the computational realization of attention mechanism (attention pooling), i.e., a linear summation over values vi, where weights are derived according to the compatibility between a query q and keys ki.

Depending on the input data, queries, keys, and values can be represented in various ways, such as vectors, matrices, or tensors. In the context of NPL, queries and keys are often represented as word embeddings, while values are represented as contextualized embeddings.

Visualizing queries, keys, and values can help enhance understanding. For example, the transformer model’s attention mechanism can be visualized as a matrix, where the rows correspond to the queries and the columns correspond to the keys. The values are represented as a matrix, where each row corresponds to a position in the input sequence.

Queries, keys, and values are used in many transformer models, such as the BERT and RoBERTa. In these models, the attention mechanism enables the models to attend to relevant parts of the input sequence, generating accurate predictions.

For example, in the BERT model, the queries and keys are generated from the input text, and the values are generated from the contextualized embeddings of the input text. The attention mechanism enables the model to attend to, for instance, the subject and object of a sentence, generating accurate predictions for tasks such as sentence classification and named entity recognition.

Queries, keys, and values are the fundamental concepts of the attention mechanism in transformer models. Through the use of visualizations and representations (such as vectors, matrices, or tensors), queries, keys, and values enable the models to focus on relevant parts of the input sequence, generating accurate predictions for a wide range of tasks in NLP.

Attention Pooling by Similarity

Attention pooling by similarity is a technique used in transformer models to generate a weighted sum of the input sequence based on a similarity measure. The attention mechanism compares the queries with the keys and generates a score representing their similarity. The scores are then normalized and used as weights to generate a weighted sum of the values, which is used as the output of the attention mechanism.

Kernels and data are used to implement attention pooling by similarity in transformer models. The kernel function is used to measure the similarity between the queries and the keys, and it can be chosen based on the characteristics of the data. For example, the dot-product kernel and the scaled-dot-product kernel are commonly used in transformer models. The dot-product kernel measures the dot product between the queries and the keys, while the scaled-dot-product kernel scales the dot product by the square root of the dimension of the queries and keys.

The data are used to represent the input sequence, and they can be represented in various ways, such as word embeddings or image patches. The data can also be processed in different ways, such as convolutions or self-attention.

A transformer model can use attention pooling by similarity to generate contextual representations of the input text or to perform language translation tasks. In these cases, attention pooling by similarity enables the models to attend to relevant parts of the input sequence, generating accurate predictions.

Another example of attention pooling by similarity is the Vision Transformer (ViT) model, which uses it to generate visual embeddings of image patches. In this model, attention pooling by similarity enables the model to attend to the most relevant image patches, improving its accuracy in image classification tasks.

So, attention pooling by similarity is a powerful technique widely used in transformer models. Through the use of kernels and data, attention pooling by similarity enables the models to focus on relevant parts of the input sequence.

Attention Pooling via Nadaraya-Watson Regression

Attention pooling via Nadaraya-Watson regression is a non-parametric regression technique used in transformer models to generate a weighted sum of the input sequence based on a kernel function. Nadaraya-Watson regression assigns weights to each data point based on its distance from the query and then normalizes the weights to generate a weighted sum of the values.

Nadaraya-Watson regression has been applied in many transformer models, such as the RoBERTa and XLNet, which uses it to generate contextual embeddings of words. In RoBERTa, Nadaraya-Watson regression helps the model to attend to the most relevant parts of the input sequence, which improves its accuracy in generating predictions. Similarly, the XLNet model uses Nadaraya-Watson regression to capture dependencies between tokens and generate contextualized embeddings, which improves its accuracy in language modeling tasks.

One advantage of attention pooling via Nadaraya-Watson regression is that it is computationally efficient, making it well-suited for large-scale applications. Compared to other attention pooling techniques that require large amounts of memory to store the similarity scores between the query and the keys, Nadaraya-Watson regression only requires the computation of the kernel function, which is typically less computationally expensive.

However, there are also disadvantages to attention pooling via Nadaraya-Watson regression: It relies heavily on the kernel function chosen, which may not always capture the nuances of the data. Consequently, it may not capture complex relationships between the input sequence and the output sequence, as well as other attention mechanisms. Additionally, it assumes that the data are independent and identically distributed (iid), which may not hold for some applications.

Despite these limitations, attention pooling via Nadaraya-Watson regression remains valuable in many transformer models. It is particularly effective in generating contextual embeddings of words and capturing dependencies between tokens, making it well-suited for some NLP applications. Its effectiveness depends on the specific application and the kernel function chosen.

Adapting Attention Pooling

Attention pooling can be adapted to different applications by changing the kernel function, modifying the data points, or introducing additional constraints. For example, dynamic pooling can be used to adapt attention pooling to time-series data, while adaptive pooling can be used to adapt it to non-stationary data.

Dynamic pooling has been used in certain transformer models, such as the TimeSformer model, which uses it to generate spatiotemporal representations of videos. In this model, dynamic pooling adapts attention pooling to the time-series nature of the input data.

Adaptive pooling has been used in transformers, such as the ViTs, which use it to generate visual embeddings. In this model, adaptive pooling adapts attention pooling to the non-stationary nature of the input data.

Adapting attention pooling is a promising research area, as it enables attention mechanisms to be applied to various applications. Different methods for adapting attention pooling, such as dynamic pooling and adaptive pooling, allow attention mechanisms to be tailored to specific applications, improving the accuracy of the predictions.

Moreover, attention pooling can be adapted to various tasks, such as language modeling, translation, and image classification. For example, in the Text-to-Text Transfer Transformer (T5) model, the attention mechanism is adapted to perform text-to-text transformations, generating accurate predictions for various tasks.

Examples of the applications of attention mechanisms to transformer models, such as the BERT, GPT, and RoBERTa language models, the T5 and BART models, the XLNet and TimeSformer models, and the ViT model, demonstrate the wide range of use cases that attention mechanisms can handle. Through visualization, kernel functions, data, and adaptations, attention mechanisms have helped transformer models to revolutionize the field of natural language processing and image classification, and they will continue to be an important research area in the future.

Future

Attention mechanisms have been essential for transformer models, enabling them to generate accurate predictions by attending to relevant parts of the input sequence. Attention pooling by similarity and Nadaraya-Watson regression are used to generate weighted sums of the input sequence while adapting attention pooling to different applications using methods such as dynamic pooling and adaptive pooling improves the accuracy of the predictions.

With the advancements in deep learning and the development of new transformer models, attention mechanisms may continue to play an essential role in improving their performance and making them more adaptable to different applications. As a result, attention mechanisms may remain an important area of research for computer engineers, data scientists, and software engineers alike.

Allow me to take this opportunity to thank you for being here! I would be unable to do what I do without people like you who follow along and take that leap of faith to read my postings.

If you like my content, please feel free to press the “Follow” button at the upper-right corner of your screen (below my photo). Once you have pressed the button, Medium will remember to update you on a real-time basis. I can also be contacted on LinkedIn, Facebook, or Twitter: Twitter.