What’s the Difference Between Self-Attention and Attention in Transformer Architecture?

The one article you need if you’re still confused on which is which

4 min readDec 3, 2022

Are you interested in learning about the transformer architecture, a popular neural network model used in natural language processing (NLP) tasks? If so, you may have heard about self-attention and attention, two related but distinct concepts that are central to the transformer model. In this blog post, we will explain the difference between self-attention and attention in transformer architecture and why they are important for the performance of transformer models.

We won’t go deep into the maths involved in these algorithms — there are already blog posts like this one which explain it much better than I could. Instead, we will aim to get a high-level overview of their differences.

Attention

Self-attention and attention are both mechanisms that allow transformer models to attend to different parts of the input or output sequences when making predictions. These mechanisms are crucial for the performance of transformer models in tasks such as language translation, text summarization, and sentiment analysis, where the model needs to understand the relationships between different words or phrases in the input and output sequences.

Attention refers to the ability of a transformer model to attend to different parts of another sequence when making predictions. This is often used in encoder-decoder architectures, where the encoder vectorizes the input sequence, and the decoder attends to the encoded representation of the whole input when making predictions. For example, in a language translation task, the encoder processes source language sentence and generates an encoded representation of it, which the decoder then attends to when generating the translation in the target language.

Decoder using attention mechanism to produce a single output Yt from encoder-
created vectors h

What’s the big deal? Previously, if you were using recurrent architectures like LSTM, your architecture had a strong bottleneck — the input sequence had to be encoded in a single summary vector, and the decoder had to pass its information further during subsequent decoding steps. It meant that the amount of information which we could propagate was severly limited, and the window of retained information was much shorter than it is in the case of attention.

Now, the window of information is virtually unlimited (as in — limited only by your hardware capabilities), because you can access information from any element of the input sequence.

Self-attention

Self-attention, on the other hand, refers to the ability of a transformer model to attend to different parts of the input sequence when making predictions. The name comes from the fact that contrary to “regular” attention, self-attention refers to the the same sequence which is currently being encoded.

The breakthrough is similar to attention’s one — back in recurrent architectures the encoder had to compress all information that was needed further down the line in a set of vectors which were passed around by the recurrent cells. You might already see where this is going: this setup was also prone to “forgetting” some facts if the window of information was too large.

Self-attention allows us to look at the whole context of our sequence while encoding each of the input elements. No forgetting will occur here, because our window of retaining information is exactly as large as we need it to be (at least in the base version of the self-attention).

Conclusion

In summary, self-attention allows a transformer model to attend to different parts of the same input sequence, while attention allows a transformer model to attend to different parts of another sequence. Both mechanisms are important for the performance of transformer models in NLP tasks, as they allow the model to understand the relationships between different elements in the input and output sequences and make more accurate predictions.

Self-attention and attention are two key concepts in the transformer architecture, a powerful neural network model used in NLP tasks. Understanding the difference between these two mechanisms and how they work can help you appreciate the capabilities and limitations of transformer models and use them effectively in your own NLP projects.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com