Understanding “Attention is All You Need”

7 min readApr 12, 2024

Take a quick dive into the important concepts behind “Transformers”. Let’s understand why they matter and how they make the model work so smoothly.

Intuition behind Tokenization

Machine learning models are just big statistical calculators which work on numbers, not words. Therefore, before feeding the words of sentences as an input to the transformers model, we must tokenize the words as shown in the figure below.

Simply put, this converts the words into numbers with each number representing the position in a dictionary of all the possible words that the model can work with. There are other techniques for tokenization as well, for e.g., we could use different tokens for different parts of the words as shown in the figure below.

Fig. 3: Tokenization for different parts of words

Once the words are converted into tokens, they are then sent as an input to the embedding layer of the model shown in the transformer architecture in Fig. 1.

Intuition behind Embedding Layer

The embedding layer is a trainable high dimensional vector space, where each token is represented as a vector and occupies a unique position within that space.

Fig. 4: Input embedding layer which gets the tokens as an input and creates a high dimension vector for each of these tokens.

By doing so, the intuition is that these vectors learn to encode the meaning and context of individual tokens in the input sequence. Some examples of such type of learnable embeddings are Word2Vec and GloVe.

As an example shown in the figure below, imagine that the embedding layer maps the tokens into a 3-dimensional space where different words (represented as tokens) are positioned. The embedding layer is trained in a way to learn how closely related are different words in the input sequence using these high dimensional vectors. Its the ability that gives model the power to mathematically understand the language.

Intuition behind Positional Encoding

In transformer models, positional encoding is crucial for enabling the model to understand the order or position of tokens within a sequence. This is particularly important because transformer models don’t inherently understand sequential order (unlike recurrent neural networks), instead work on each tokens parallelly. Therefore, the model adds the token embeddings generated from the previous steps to the positional encodings of each token, as shown below.

Token embeddings combined with positional encodings.

Let’s see how positional encodings work…

An example of positional encoding of two words in the sequence.

Positional encodings makes use of sinusoidal waveforms to position the tokens in the input sequence. Based on the scales of each word in the sine/cosine waveforms, we can say if the two words are close to each other or far apart.

Eg. If one word in on the peak of 4th waveform (i=4), and the other one lies on the bottom of the 4th waveform, the two words tends to be far apart. Whereas, if both the words lie on the peak of the 4th and 2nd waveform, but not for 0th waveform, it means the two words are close to each other.

At the end, once the positional encodings are combined with the token embeddings, the generated vector is fed to the self attention module which is responsible for computing the key, query and value matrices.

What is the positional encoding in the transformer model?

I'm trying to read and understand the paper Attention is all you need and in it, there is a picture: I don't know what…

datascience.stackexchange.com

Intuition behind Multi-headed Self-attention

Once the vector is generated consisting of the positional encoding and token embedding, it is fed to the attention module. This module attends to every vector parallelly and tries to learn how much attention each token gives to every other token.

The important thing to note here is that the transformers architecture uses multiple heads of such self attention modules. The intuition behind using multiple heads is that each head tries to understand a different aspect of the language. For e.g., one head might learn about the people used in the sentence, the other tries to learn if there are any rhyming words in the sentence, etc.

Intuition behind Key, Query and Value

We have a bunch of K in high dimensional space. Each K is associated to one V.

**Key** vectors in high dimensional space with their associated **Value**.

2. When a Q is raised, a dot product is performed between the raised Q and each K.

Perform dot product of **Query** vector with every **Key** vector.

3. Apply Softmax on the dot products to perform normalization of the dot products. This softmax score determines how much each word will be expressed at this position

NOTE: The length of the vectors could be quite long. Thus, its highly probable that in such cases the dot product becomes huge leading to unstable gradient computation. To avoid this, the dot products are divided by a constant scalar (square root of the dimension of vector) before performing the softmax operation.

4. The transformer architecture uses attention at three different places in the network.

The encoder attention block produces set of Key-Value pairs for the input sequence. The decoder attention block produces set of Query vectors. Together they act as an input (Key-Query-Value) to the second attention block of the decoder.

The Value vectors from encoder blocks gives an intuition about the interesting (useful) things form the input sequence and their corresponding Key vectors are ways to index their Values.

On the other hand, the Query vector coming from the decoder requests about the interesting (useful) information from the encoder attention block.

Intuition behind LayerNorm

Layer Normalization is applied prior to the self-attention and feedforward blocks. This positioning of LN is key to managing the gradient scales, which in turn supports the training process. The normalization process ensures that the feature values within each token possess a mean of 0 and a standard deviation of 1.

BatchNorm v/s LayerNorm

For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation.
Layer normalization normalizes input across the features instead of normalizing input features across the batch dimension in batch normalization.

Why LayerNorm in Transformers?

In transformer models, each position in the sequence is processed independently of others. Batch normalization, which calculates statistics (mean and variance) over the entire batch, would introduce dependencies between different positions, violating the independence assumption. Layer normalization, on the other hand, computes statistics independently for each position in the sequence, making it more suitable for sequential data.

Intuition behind the Transductive nature of Tranformers

Transformers are transductive in nature unlike neural network which are inductive in nature.

Transductive learning is like practicing a specific level of the game over and over until you’ve mastered it. You learn all the tricks and patterns for that level, so you can beat it easily. But if you’re thrown into a different level, you might struggle because you’ve only learned how to beat that specific one.

Inductive learning, on the other hand, is like learning the basic strategies and skills that can help you tackle any level of the game. Instead of just memorizing one level, you understand the game mechanics and learn how to adapt to different challenges. So, when you encounter a new level, you can apply what you’ve learned to figure out how to beat it, even if it’s different from the levels you’ve practiced before.

Note: Even transformers have some inductive bias due to residual connections, but still very less compared to CNNs.

Time complexity of Transformers

Since transformers process all inputs parallel, the path length of their model is shorter than that of RNNs, which process each input one at a time.

If you have a sequence of length n. Then a transformer will have access to each element with O(1) sequential operations where a recurrent neural network will need at most O(n) sequential operations to access an element.

Why is it so?

RNNs tend to ‘remember’ the information from previous time steps. This leads to information loss. Whereas, in Transformers at each step we have direct access to all the other steps (self-attention), which practically leaves no room for information loss. In fact, very long sequences gives you problem with exploding and vanishing gradients because of the chain rule in backpropagation.

Applications of Transformers Architecture

The transformers architecture consists of two main blocks — the encoder and the decoder. However, these two blocks can be split and be used individually too depending on the use cases.

Encoder-only models are used for tasks classification tasks like sentiment analysis. Encoder-Decoder models are used for machine translation, whereas Decoder-only models are the most commonly used these days for Generative AI like ChatGPT.