What is all the fuss about Attention in Generative AI?

Published in

Autonomous Agents

4 min readFeb 19, 2023

Surprised how many people in the AI and ML field regurgitate about the famous “Attention” mechanism in Vaswani et al., “Attention is all you need” paper without actually knowing what it is. Did you know that Attention has been around since the 1990s as sigma-pi units? Here is an essential primer for beginners at a very high level without getting into intricate details.

What are Transformers?

Transformers are a type of deep learning model that has gained much popularity in natural language processing (NLP) tasks. One of the essential components of Transformers is attention, which allows the model to focus on certain parts of the input sequence while processing it.

What is Attention?

Attention is a mechanism that allows a deep learning model to selectively focus on certain parts of the input sequence while processing it. The idea behind attention is inspired by how humans focus on different parts of a scene when processing visual information. For example, we tend to focus more on certain words that convey important information when reading a sentence.

Attention in Transformers

Transformers use a specific attention mechanism called self-attention, which allows the model to focus on different parts of the input sequence while processing it. The self-attention mechanism in Transformers is also sometimes called multi-headed attention.

Self-attention works by computing a weighted sum of the input sequence, where the weights are computed based on the similarity between each element in the sequence and a query vector. The query vector is computed based on the current representation of the input sequence, updated at each layer of the model.

To compute the weights, the input sequence is transformed into three vectors:

Q: the query vector,
K: the key vector, and
V: the value vector.

These vectors are computed by applying three different linear transformations to the input sequence. The query vector is used to compute the similarity between each element in the sequence and the query vector. In contrast, the key and value vectors are used to determine how much attention should be given to each element in the sequence.

Once the weights are computed, they are used to compute a weighted sum of the value vectors, which gives us the attention vector. This attention vector is then combined with the current representation of the input sequence to produce the output of the self-attention layer.

Multi-Headed Attention

Transformers also use multi-headed attention, which allows the model to attend to information from different representation subspaces at different positions. Multi-headed attention performs multiple attention operations in parallel, each with its own query, key, and value vectors. The outputs of these parallel attention operations are concatenated and linearly transformed to produce the final output of the self-attention layer.

The Real Seminal work related to Attention Mechanism

If you need to quote the seminal and impactful use of the Attention mechanism, then here are three ‘real’ significant breakthroughs:

1) Larochelle and Hinton (2010) mind-blowing insights for Attention that gave rise to CapsNet later (Learning to combine foveal glimpses with a third-order Boltzmann machine)

2) Content-Base attention from 2014 by Graves and Self-Attention from 2016 Cheng papers.

3) Additive Attention (The real seminal paper) from Bahdanau (2014) and Multiplicative Attention from Luong (2015). These were indeed breakthroughs.

Neither self-attention nor Multiplicative dot product is new and predates Vaswani et al papers by years.

Technically everything above is Transformers (encoder-decoder sequence transduction models with attention)

What Vaswani et al papers did as an incremental innovation are two things (Which are pretty beautiful and poetic, as it turned out later)

1) Dispensing the recurrence and convolutions altogether (which leaves you ONLY with attention)
2) Use the self-attention as a stacked, point-wise, fully connected layer for both the encoder and decoder.

The beauty of this was the scaled dot product (Built on top of Luong Attention).

We can get to the Math debates on the side later, but one must learn the technical intricacies apriori before debating ad-nauseam philosophically on what is radical or iterative! It does not matter.

Attention comes in the following forms:
1) Implicit vs Explicit Attention
2) Soft vs Hard Attention
3) Global vs Local
4) For Convolutions: Spatial vs Channel Attention.

Also, there are different types of Alignment scores for Attention:
1) Content-Base based on Cosine scores
2) Additive (Bahdanu et al.)
3) Location-base, General and Multiplicative (three separate intricacies) (Luong)
4) Scaled Dot Product (Vaswani et al)

Conclusion

Attention is a critical component of Transformers, allowing the model to selectively focus on certain parts of the input sequence while processing it. Self-attention works by computing a weighted sum of the input sequence, where the weights are computed based on the similarity between each element in the sequence and a query vector.

Multi-headed attention allows the model to attend to information from different representation subspaces at different positions, improving the model’s ability to capture complex patterns in the input sequence. Overall, attention has been key to the success of Transformers in NLP tasks, and will likely continue to be an important area of research in deep learning.