Attention is All You Need: Demystifying the Transformer Revolution in NLP

Published in

Analytics Vidhya

10 min readFeb 11, 2024

Introduction

Today, everyone is buzzing about ChatGPT. It’s no longer just for tech-savvy folks — nearly everyone is using it and experiencing the power of AI firsthand. But what’s really behind its incredible abilities? Let’s take a closer look at the secret sauce: Transformers.

Do you recall when we talked about Long Short-Term Memory (LSTMs) before? We looked at how these special structures changed the way we handle long sentences in language tasks. LSTMs are important in NLP, but they still have some issues. For example, they struggle with really long sentences or complicated connections between words.

Enter the Transformer, introduced in 2017 with the groundbreaking paper Attention is All You Need. This clever model moved away from the step-by-step processing of RNNs and instead focused on a revolutionary idea: Attention.

In this article, we delve deeper into the concept of attention and explore how Transformers have transformed the landscape of NLP. We’ll examine their architecture, advantages over traditional models like LSTMs, and their impact on various NLP tasks. Let’s uncover the transformative power of attention and its role in shaping modern language processing models.

We will be using the figures from the paper and Jay Alammar’s blog.

Transformers Architecture

(left) The Transformer — model architecture; Source: paper. (right) Highlighted encoder-decoder sections

Transformers utilize an encoder-decoder structure, with the encoder positioned on the left (orange colour-coded rectangle) and the decoder on the right (green colour-coded rectangle) in the architecture diagram above.

The encoder consists of 2 sublayers: an attention layer and a fully connected feed-forward network. On the other hand, the decoder consists of 3 sublayers: two attention layers and one fully connected feed-forward network.

Encoder

The encoder comprises a stack of N = 6 identical layers, each layer having unique weights. These layers share a consistent structure but differ in their learned parameters.

Initially, input data, with each word represented as a vector of size 512, undergoes a self-attention mechanism, enabling the encoder to analyze relationships between words in the input sentence while encoding the entire sequence This self-attention mechanism is further elaborated upon later in this article.

Subsequently, the outputs from the self-attention layer are processed by a feed-forward neural network, which independently operates on each position in the input sequence.

The Power of Attention

Attention allows models to dynamically focus on different parts of the input sequence, assigning different weights or importance to each part based on its relevance to the current task. Instead of treating the entire input sequence equally, attention mechanisms enable the model to attend to specific input portions more effectively.

Self-attention

Self-attention, also known as intra-attention, is an attention mechanism that examines various positions within a single sequence to generate a representation of the entire sequence.

Say the following sentence is an input sentence we want to translate:

“The artist painted a picture, but she didn’t like it.”

What does “it” in this sentence refer to? Is it referring to the “artist” or the “picture”? It’s a simple question to a human, but not as simple to an algorithm.

In this sentence, when the model processes the word “it”, the self-attention mechanism allows it to associate “it” with either “picture” or “artist” because both are mentioned earlier in the sentence. The model relies on the context provided by the surrounding words to determine which noun “it” refers to.

Scaled Dot-Product Attention; Source: paper

The mechanism consists of 3 vectors —

Query vector (Q)
Key vector (K)
Value vector (V)

Formula for self-attention; Source: paper

Let’s break the equation and try to visualise and understand it better —

Step 1 — Calculate Attention Scores — Dot product of query vector (Q)and key vector (K).
Step 2 — Divide attention scores by √dₖ (i.e., √64 = 8). As per the paper, the key vector dimension is mentioned as 64.
Step 3 — Apply softmax — Input words we want to focus on get values closer to 1, and the rest words closer to 0.
Step 4 — Multiply each softmax score by the value vector. This is done to preserve the values of the words we want to focus on and diminish the influence of irrelevant words by multiplying them by very small numbers.

Representation of self-attention computation; Source: blog

Now, let’s examine the matrix-level computation. When implemented, the earlier calculation is conducted in matrix form to enhance processing speed.

In the above figure, each row in the X matrix represents a word in the input sentence.

Representation of formula for Self-attention; Source: blog

Here, Z is the output of the attention mechanism.

Multi-Head Attention

The paper introduced a refinement to the self-attention layer called “multi-headed” attention, which brings two significant advantages:

Enhanced Positional Focusing: Multi-headed attention broadens the model’s capability to concentrate on different positions within the input sequence. In self-attention, the attention vector may prioritize the relation of each word with itself, which is important but not sufficient. Multi-headed attention enables the model to consider the relationships between each word and other words more effectively.
Diverse Representation Subspaces: Multi-headed attention provides the attention layer with multiple “representation subspaces”. With this approach, the Transformer utilizes not just one, but multiple sets of Query/Key/Value weight matrices (8 parallel attention layers, or heads.) The final attention vector for each word is computed by taking the weighted average across all attention heads. This multi-head approach enhances the model’s ability to capture complex relationships and patterns in the input data.

Performing the same self-attention calculation 8 times with different weight matrices yields 8 different Z matrices.

8 different attention heads; Source: blog

Now, as mentioned in the second point, the final attention vector for each word is computed by taking the weighted average across all attention heads.

Single attention matrix out of 8 different attention heads; Source: blog

Putting it all together, we can visualise the entire mechanism in the figure below —

Attention mechanism at a glance; Source: blog

Positional Encoding

Consider the following sentences:

Sentence 1: “I like to eat an apple a day.”
Sentence 2: “I work at apple.”

These sentences highlight how the word “apple” can take on different meanings depending on its context. To enable the model to understand these nuances, positional encoding is utilized. Positional encoding assigns a specific position to each word in a sentence, providing essential context for interpretation. However, the model initially lacks a mechanism to account for the order of words in the input sequence.

When we rearrange the words in the input sentence, the solutions remain unchanged. To address this, we must generate a representation of each word’s position in the sentence and incorporate it into the word embedding.

To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed.
Source: paper

2 positional encoders in transformers; Source: paper

So, the transformer model incorporates a vector into each input embedding. These vectors adhere to a specific pattern that the model learns. This pattern aids the model in determining the position of each word and the distance between different words in the sequence. The idea behind this approach is that these additional values, when added to the embeddings, yield meaningful distances between the embedding vectors during dot-product attention, further enhancing the model’s ability to accurately interpret words based on their specific positions in different sentences.

The paper uses sine and cosine functions of different frequencies to calculate the positional encoding.

The formula for Positional Encoding; Source: paper

Understanding the formula —

For each token at position ‘pos’ in the output Embedding from the ‘input embedding’ layer (depicted by the pink box in the diagram above), we will produce a position vector of dimension 512 (equivalent to d_model, the embedding dimension for each token). For every even embedding index (ranging from 0 to 512, i.e., 0, 2, 4, … 510), we apply the first formula. For odd indexes (1, 3, 5, … 511), we use the second formula.

In the vector representation, this is how positional encoders look like in transformers —

Representation of positional encoding in transformers architecture; Source: blog

But why is the output shifted right? We will understand this once we delve into decoders later in the article.

Residuals

A residual connection is applied around each of the two sub-layers, followed by layer normalization. This involves adding the input to the output of the sub-layer and then applying layer normalization to ensure smoother information flow through the network. As described in the paper, all sub-layers in the model, as well as the embedding layers, generate outputs of dimension d_model = 512 to facilitate these residual connections.

Detailed residual representation at the encoder side; Source: blog

Residual connections are implemented within each sub-layer of both the encoder and decoder stacks.

Decoder

Just like the encoder, the decoder also comprises a stack of N = 6 identical layers.

Similar to the encoder, residual connections are employed around each sub-layer in the decoder, followed by layer normalization. Additionally, adjustments are made to the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with the offset of the output embeddings by one position, ensures that predictions for position i can only depend on the known outputs at positions less than i.

The encoder begins by processing the input sequence. The resulting output from the top encoder is then converted into a set of attention vectors K and V. Each decoder utilizes these vectors in its attention layer, enabling the decoder to concentrate on relevant areas within the input sequence.

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).; Source: blog

The primary objective of employing a Decoder is to sequentially infer the tokens of the output sequence. This is accomplished by leveraging:

Attention mechanism.
Previously predicted output tokens.

The decoder working till we reach the EOS token; Source: blog

But why is the output shifted right?

Since the decoder is trained to forecast the subsequent word in the sequence based on the preceding tokens and attention from the Encoder, the absence of prior tokens for the first token poses a challenge. This would result in difficulties predicting the initial token consistently. Consequently, the output sequence is shifted, and a ‘BOS’ (Beginning of Sentence/Sequence) is inserted at the start. Thus, when predicting the first token, this ‘BOS’ serves as the preceding token in the output sequence.

Final Linear and Softmax Layer

The output from the decoder is a vector of floating-point numbers. Our next task is to convert this vector into a word, which is achieved by the final linear layer followed by a softmax layer.

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
Source: blog

The process depicted in this figure begins with the vector generated as the output of the decoder stack. Subsequently, this vector is transformed into an output word.; Source: blog

Conclusion

The transformer model, introduced with the groundbreaking paper ‘Attention is All You Need’, has revolutionized NLP by shifting the paradigm from sequential processing to parallel attention mechanisms. By leveraging self-attention and multi-head attention mechanisms, Transformers have demonstrated remarkable capabilities in analyzing relationships between words in input sequences, thereby overcoming the limitations of traditional models like LSTMs. Additionally, the incorporation of positional encoding and residual connections further enhances the Transformer’s ability to understand and interpret language contextually. With its transformative power, the Transformer architecture has paved the way for significant advancements in various NLP tasks, offering a promising direction for future research and applications in the field of artificial intelligence.

Furthermore, the Transformer architecture has served as the foundation for state-of-the-art NLP models such as ChatGPT, BERT (Bidirectional Encoder Representations from Transformers), and more. These models build upon the principles introduced by the Transformer, enhancing them with additional innovations and modifications to achieve even higher levels of performance in tasks such as language generation, question answering, and sentiment analysis. As the field progresses, the impact of the Transformer continues to be significant, fueling ongoing advancements and innovative developments in NLP.

If you like it, please leave a 👏.

Feedback/suggestions are always welcome.