Attention Is All You Need: Understanding Transformers

Ritik Nandwal
4 min readSep 3, 2024

--

Introduction

Before 2017, models like RNNs and LSTMs handled sequences of data but often struggled with long-range dependencies and were slow to train. The “Attention is All You Need” paper introduced the Transformer model, which relies entirely on attention mechanisms, eliminating the need for recurrence. This innovation not only addressed the limitations of earlier models but also transformed NLP, enabling powerful models like GPT and BERT.

In this article I will walk through the summary and easy explanation for the research paper, which introduced attention mechanism.

The Transformer Architecture

Transformer Architecture

Component Breakdown

A transformer block consists of

Encoder

It is composed of 6 identical layers , each layer has two sublayers.
|- subLayer1 → Multi-Head Self Attention followed by Add & norm
|- subLayer2 → Feed Forward followed by Add & norm

Encoder Block

Decoder

It is composed of 6 identical layers , each layer has two sublayers.
|- sublayer1 → Multi-Head Self Attention followed by Add & Norm
|- sublayer2 → Multi-Head Self Attention Over Encoder’s Output followed by Add & Norm
|- sublayer3 → Feed Forward followed by Add & Norm

Decoder Block

Attention

Multihead Attention

The heart of transformer lies in Attention Mechanism, which is a scaled dot product of three vectors (Query,Key,Values).

Scaled Dot Product Attention
  • Query → It is a vector representing current word, for which we want to calculate the attention weights.
  • Key → It is a vector which acts as an identifier, that helps to determine if a part of the sequence is relevant to what the query is looking for.
  • Value → It is a vector which carry the actual information that will be used to build the next layer’s representation.
Scaled Dot Product Attention

A multihead attention consists of h blocks of attention stacked up in parallel.

Multi-Head Attention

What Does a Self Attention Represent ?

Self-attention, often called intra-attention, allows each word in a sequence to focus on different words in the same sequence, helping the model understand relationships and dependencies between words.

For example, consider the sentence: “The animal didn’t cross the street because it was too tired.” Here’s how self-attention works:

  1. Identify Relationships: The word “it” could refer to different entities in the sentence. Self-attention helps in identifying that “it” refers to “the animal” by evaluating the context provided by all other words in the sentence.
  2. Contextual Relevance: Self-attention calculates how much focus each word should give to every other word. In the case of “it”, the model uses self-attention to assign higher weights to the word “animal” compared to other words, making the relationship clear.

Position Wise Feed Forward Network

The attention sublayer in encoder and decoder is followed by a fully connected feed-forward layer, which is applied to each position separately and identically, it consists of two linear transformation with a ReLU activation in between.

Layer Normalization

The multihead attention layer and Feed forward layer is followed by a add and normalization layer in both encoder and decoder.

The add operation is for residual connection(carries the previous information), which adds the input to the output of multihead attention, which is then normalized.

Positional Embedding

Transformers treat input sequence as set rather than a ordered sequence.

Adding PE to Input Embedding

Positional embedding is added to the input embedding to give the model, information about the position of token in Sequence.

Note : This is not a model parameter.

We can calculate the PE for a given pos and dimension i using:

Sine and Cosine Function for calculating PE

Let’s understand this with an example

Sentence 1: “She wore a beautiful red dress.”

Sentence 2: “She has a red Ferrari.”

In both sentences, the word “red” appears. Using only word embeddings (without positional embeddings), the embedding for “red” would be similar in both sentences, as it reflects the semantic meaning of the word itself. However, the context in which “red” appears is different, to make sense of these different contexts, we need to consider the position of “red” in each sentence

Conclusion

I tried to keep the article short and explanatory, I would highly recommend to read the original research paper to get a deeper insights. In the next part we will be implementing transformer from scratch and train it on NMT task. Do give it a read to have a deeper understanding of its implementations.

Transformer From Scratch In Pytorch.

References

--

--