A Simplified Explanation of the Transformer Block [Must-Read Blog for NLP Enthusiasts]

A Detailed Look at the Different Components of the Transformer Blocks

6 min readDec 20, 2022

For a broad overview and information on seq2seq RNNs, see this blog.

Seq2Seq RNNs with Attention: A Quick Intuition for Understanding the Basics

Introduction

ChatGPT is an AI chatbot (created by OpenAI) that excels at performing amazing tasks at a human level, including question-answering, dialogue, writing essays, mail, and even code.

The key innovation of the transformer architecture is the use of self-attention mechanisms, which allow the model to efficiently process input sequences of variable length and to learn relationships between elements in the sequence. This makes transformers particularly well-suited for tasks that involve processing long-range dependencies, such as language translation, where the meaning of a word may depend on the context in which it appears several words earlier in the input sequence.

Paper — https://arxiv.org/pdf/1706.03762.pdf

Due to the fast-moving nature of the paper’s description of the overall transformer model and attention mechanism. Sometimes it can be overwhelming to understand and get the intuition right upon first reading.

In this blog, I will go over each transformer component individually. We will subsequently compile all of our knowledge to create the final transformer block.

Input to model

We have a language translation task. We will discuss the working and concept with a example X ( no batch feeding). Lets assume we have done all the text-preprocessing needed.

First things first, we have tokenized vector as an input. This input passes through embedding matrix followed by positional encoding (why we need position info. ? will discuss this below further) to convert it into dense vectors. Now, the input X feeded to transformer block has shape [T x Dmodel] where T is for word sequence and Dmodel is feature vector for the word.

Key, Query, Value and multi-head concept

This concept in field of deep learning is new. We pass input through 3 different linear layer of nodes Dk. These output are called the “key K”, “query Q”, and “value V”. Each shape [T x Dk].

View I

Lets talk in matrix sense.

We 1st take dot product of query and key vector/matrix. Resultant QK matrix will have shape [T x T].

The transformer encoder training builds the weight parameter matrices Wq and Qk in the way Q and K builds the Inquiry System that answers the inquiry "What is k for the word q".
Also, this dot product followed with its normalization co-relates the pearson correlation formula with zero mean. Mean, we are eventually finding the co-relation matrix for each word with other words in input.

This attention weight matrix QK makes model to understand context around words. Take the intuition from blow example, how the word “NLP” related to other words in sequence.

We the multiply the QK attention matrix to the “Value V” vector to get final intermediate output X of shape [T x Dv]. (NOTE Dk=Dv=Dq)

Multi-Head Attention

The independent stack of self-attention computation describe in above points is know as multi-head attention system.

We need multi self-attention stacked modules because it might be the case that our model learns “Why we use nlp?” in one self-attention block and “what is a sub-part of ML?” in another block.

View II

The transformer takes in a sequence of input vectors and applies an attention mechanism to compute a weighted sum of the input vectors.

The attention mechanism consists of three components: the query, the key, and the value.
The query is a vector that represents the current position in the input sequence. It is used to determine the relationship between the current position and the other positions in the input sequence.
The key is a vector that represents each position in the input sequence. It is used to determine the output (value) of the attention mechanism.
The value is the result of the attention computation and is used to compute the output of the transformer.

To compute the attention, the transformer computes the dot product of the query vector with the key vectors for each position in the input sequence. These dot products are then used to compute the weights for each value vector, which are then used to compute the weighted sum of the value vectors.

Pair strength determines the co-relation of each word to other.

Note- all operation done above are linear in nature

Attention mask and Non-linearity in model [complex understanding]

Before multiplying QK to “Value V” matrix/vector to get intermediate value, note there is an option named “Mask”. We have fixed input length. This attention mask resemble for this input sequence. “1” for the position where there is actual word. “0” for padded position.

The output of all stacked self-attention module (multi-head attention) is then concatenated. Resulting shape (h*[T x Dv]) is [T x h.Dv]. Followed by Linear operation (weight matrix shape [h.Dv x Dmodel]) with non-linear activation. The resultant output shape at this stage is [T x Dmodel].

To conclude, Multi-Head Attention Block takes 3 input K, Q, V all of same input shape i.e. [T x Dk] and outputs [T x Dmodel].

Compiling all Blocks

NOTE- input and output of the block in same [T x Dmodel]. SO, we can stack this block “N” times.

The input is added with multi-head attention ad itself followed by layer norm. Then passed to non-linear layer with skip-connection again.

Positional encoding

The self-attention mechanism does not have any information about the relative or absolute position of the input tokens in the sequence.

Positional encoding is a way of representing the position of each token in the input sequence as a vector, which is then added to the input embedding of the token.

There are several ways to create positional encodings, but one common approach is to use sinusoidal functions of different frequencies to create a set of vectors that can be added to the input embeddings. The sinusoidal functions are chosen so that the resulting vectors have different periodicities, which allows them to encode information about the position of the tokens in the sequence.

Positional encoding is important because it allows the transformer to capture the relationships between the tokens in the sequence and their relative positions, which can be useful for tasks such as language translation and language modeling.

SOTA architecture which uses transformer block

The transformer is a versatile architecture that is well-suited to many different types of tasks. It has become a popular choice for many state-of-the-art models in the field.

1. BERT (Bidirectional Encoder Representations from Transformers)

BERT is called a bidirectional model because it is trained to consider the context on both the left and the right sides of each word in the input text (masked language modeling).

Encoder Representations means it has several stacked transformer blocks to generate deep input embeddings.

2. GPT (Generative Pre-training Transformer)

GPT is a model that is trained to generate human-like text. It can be fine-tuned for a wide range of language generation tasks, such as machine translation, summarization, and dialog systems.

GPT is trained using a technique called “unsupervised language modeling” in which the model is given a large dataset of text and is required to predict the next word in a sequence given the context of the previous words. It’s a decoder-type transformer model.