Attention Mechanism Demystified

Unravelling The Mysteries Behind Transformers

10 min readJun 15, 2023

This article is part of a series about the Transformer architecture. If you haven’t read the others, refer to the introductory article here.

It goes without saying that when you look at the Transformers architecture, it is riddled with components from different Machine Learning disciplines, each providing its own benefits to make the model robust. However, one component fundamental to its success is the Attention Mechanism, or formally, Self-Attention.

In this article, we walk through the mechanism setup-by-setup using a combination of visual representation and hands-on application. Let’s get started!

Where It’s Used

Figure 1.1. *The Transformer model architecture (Vaswani et al., 2017)*

Firstly, to avoid unnecessary confusion, let’s revisit the architecture to recall where attention is used. As shown in Figure 1.1, it is clear that the Attention Mechanism is used once inside the encoder (left block) and twice in the decoder (right block). You may have noticed two types of attention used: Multi-Headed Attention and a masked variant. Multi-Headed Attention is Self-Attention (also known as Scaled Dot-Product Attention (Vaswani et al., 2017)) divided into subsets for parallelization, and the masked variant is the same as the standard one with an optional step in the middle…

Attention Mechanism Demystified

Unravelling The Mysteries Behind Transformers

Where It’s Used

Written by Ryan Partridge