Attention Mechanism Demystified
Unravelling The Mysteries Behind Transformers
This article is part of a series about the Transformer architecture. If you haven’t read the others, refer to the introductory article here.
It goes without saying that when you look at the Transformers architecture, it is riddled with components from different Machine Learning disciplines, each providing its own benefits to make the model robust. However, one component fundamental to its success is the Attention Mechanism, or formally, Self-Attention.
In this article, we walk through the mechanism setup-by-setup using a combination of visual representation and hands-on application. Let’s get started!
Where It’s Used
Firstly, to avoid unnecessary confusion, let’s revisit the architecture to recall where attention is used. As shown in Figure 1.1, it is clear that the Attention Mechanism is used once inside the encoder (left block) and twice in the decoder (right block). You may have noticed two types of attention used: Multi-Headed Attention and a masked variant. Multi-Headed Attention is Self-Attention (also known as Scaled Dot-Product Attention (Vaswani et al., 2017)) divided into subsets for parallelization, and the masked variant is the same as the standard one with an optional step in the middle…