The Sorcery behind GPT— Comprehensive Deconstruction of LLMs!

Published in

Autonomous Agents

40 min readApr 7, 2023

LLMs have been hot ever since ChatGPT was released to the general public. While most of the applied AI folks know how to use the APIs, there is perhaps a lack of mathematical understanding of ‘why’ it works. Here is a comprehensive deconstruction of LLMs in general and Transformers in particular from a mathematical perspective.

NOTE: This is a longer-than-usual blog as a one-stop destination to understand the recipe of how and why LLMs work. You can easily skip sections that you already have a good understanding of. You can skip the math equations and still be able to understand quite a lot.

Heat map showing the proportion of attention that parts of speech receive.

At the highest level, the dominant neural transduction models, such as GPT-4 and Google Bard, are primarily based on the encoder-decoder architecture using Transformer and Attention. Understanding the inner workings and Math behind Transformers is fundamental if you need to understand LLMs.

In the Transformer architecture, the encoder takes a sequence of token representations (x1, …, xn) and transforms it into a corresponding sequence of continuous representations (z1, …, zn), called the context vector. The decoder then takes the context vector and generates a series of output tokens (y1, …, ym). The input sequence N and the output sequence M do not have to be the same length.

The beauty is that these sequences of tokens can be anything. For example, they can be:

Words that form sentences
Musical motifs that form a melody
Pixels from an image
Time series from the stock market
Chemical structures of medicine
Phonemes or speech sounds in spoken language
Data points in a table
Gestures or brain waves for controlling interfaces.
Items in a shopping cart or purchase history
Steps in a recipe or procedure
Polypeptide biological sequences for protein formation

How can a single architectural pattern like Transformer and Attention assimilate, understand and encode the context, meaning, essence, and discourse of a sequence of tokens? Why does it work?

Caveat: the math and treatment of each of these sequences has to be completely different though. It’s the architectural pattern that can be the same in most parts.

What is this Sorcery?

To begin with, Transformer follows stacked, self-attention, and point-wise, fully connected layers for both the encoder and decoder.

I do not intend to cover what is already explained in the Vaswani et al. paper named Attention is all you need. You can learn that by reading the paper directly. Instead, I will tease the most critical questions and confusion in concepts and math. I have to write this blog because after talking to many people who work in applied AI I realized they are still confused or need clarification about how and why things work.

Recipe for Attention

Transformers use Attention to learn about the context, meaning, and essence of each word, sentence, and paragraph. I have written about Attention here “What is all the fuss about Attention in Generative AI?”

Attention works by computing a set of weights that indicate the relative importance of different elements for predicting each element. This way, the model can selectively focus on the parts of the sequence that are most relevant for making predictions. This is particularly useful in tasks where there are long-range dependencies between different parts of the sequence.

Let’s use this figure as a guide to help us understand the step-by-step process required to arrive at Attention.

The breadcrumbs we shall go through to understand how LLMs work are as follows:

Tokenization > Input Embedding > Postional Encoding > Q, K, V > Attention Score > Multi-Head Attention > (Decoder) Masked Attention > (Decoder) K, V imports.

Step 1 — Tokenization

Let’s deconstruct the inner workings a bit. Attention does not directly work on words. Instead, it works on word embeddings.

INSIGHT for LLM (Or why do I need to know this): There are 171,146 words in the Oxford dictionary and millions of technical words in the overall English language. Instead of trying to use a large corpus of words, you can compress them into a fixed set of tokenization that can represent all the words and it’s extensions using a tokenization scheme.

The embeddings may come from a Tokenization Scheme that is particularly useful to the sequence on hand (Language vs. Pixels vs. Gene). Spoken languages can be subjected to the following tokenization schemes.

Whitespace tokenization: This simple method splits the text based on whitespace characters (spaces, tabs, and newlines). It is fast but may not be suitable for languages without clear word boundaries or when dealing with punctuation.
Rule-based tokenization: This method uses language-specific rules and regular expressions to split the text into tokens. Common rule-based tokenizers include the Penn Treebank Tokenizer for English and the Moses Tokenizer for multiple languages. Rule-based tokenizers can handle punctuation and contractions but may struggle with out-of-vocabulary words, slang, or language variations.
Statistical tokenization: Statistical tokenizers learn token boundaries based on the frequency of character sequences in a large text corpus. They can adapt to language variations and handle out-of-vocabulary words to some extent. Examples of statistical tokenization methods include SentencePiece and the Viterbi algorithm.
Subword tokenization: Subword tokenizers split the text into smaller units, such as subwords or characters, allowing the model to handle out-of-vocabulary words and generalize better. Common subword tokenization methods include Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model.
Morphological tokenization: This method splits text into morphemes, which are the smallest meaningful units in a language. Morphological tokenization can be particularly helpful for languages with complex morphology, such as Turkish or Finnish. Tools like the Stanford Morphological Analyzer or the Morfessor algorithm can be used for morphological tokenization.

BPE Tokenization: The most useful is the BPE tokenization which is called Byte Pair Encoding. Byte Pair Encoding (BPE) is a data compression algorithm that has been adapted for use in natural language processing as a subword tokenization method. In the context of language models, BPE-based target vocabulary refers to a vocabulary constructed using the BPE algorithm to represent text with a fixed-size vocabulary of subword units.

BPE works by iteratively merging the most frequent pairs of characters or subword units in the training data. By doing this, BPE learns a set of subword units that can be used to represent any text in the training data. This allows the model to handle rare or out-of-vocabulary words by decomposing them into smaller subword units present in the BPE vocabulary.

A BPE-based target vocabulary can consist of individual characters, common subword units (e.g., ‘ing’, ‘pre’, ‘ation’), and whole words that frequently appear in the training data. This type of vocabulary helps language models generalize better, as it reduces the problem of data sparsity and enables the model to handle a wide range of linguistic variations.

For example, OpenAI’s GPT-3 uses a BPE-based target vocabulary with 50,257 tokens, which allows it to handle a diverse range of text inputs and generate meaningful outputs across various domains.

Here are some examples to illustrate how BPE tokenization works: Let’s consider the following sentence: “ChatGPT is an AI language model developed by OpenAI.”

A potential BPE-based tokenization of this sentence could be: [“Chat”, “G”, “PT”, “ is”, “ an”, “ AI”, “ language”, “ model”, “ developed”, “ by”, “ Open”, “AI”, “.”]
Note how the word “ChatGPT” has been broken down into three subword units: “Chat”, “G”, and “PT”.
Now let’s consider a sentence with a rare or out-of-vocabulary word: “The quokka is a small marsupial native to Western Australia.”
A potential BPE-based tokenization of this sentence could be: [“The”, “ qu”, “ok”, “ka”, “ is”, “ a”, “ small”, “ mars”, “up”, “ial”, “ native”, “ to”, “ Western”, “ Australia”, “.”]
In this case, the rare word “quokka” has been decomposed into the subword units “qu”, “ok”, and “ka”.

Recreating Outputs: To recreate an output sequence from BPE (Byte Pair Encoding) tokens, you need to reverse the tokenization process that was used to encode the original sequence.

Here are the steps to recreate an output sequence from BPE tokens:

Create a dictionary that maps each BPE token to its corresponding byte pair.
Iterate through the BPE tokens in the output sequence.
For each BPE token, look up its corresponding byte pair in the dictionary and replace the BPE token with the byte pair.
Concatenate the resulting byte pairs to form the output sequence.

Step 2 — Input Embeddings

Input embedding is a technique used in Natural Language Processing (NLP) and other fields to transform discrete tokens into continuous vector representations that machine learning models can more easily process. The basic idea behind input embedding is to map each token in the input sequence to a high-dimensional vector in a continuous embedding space.

INSIGHT: Encoding the tokens to a high-dimensional vector space does 2 magical things for LLMs.
- First, it provides a fixed-sized address into the high-dimensional vector space which can be a floating point vector of size of 512.
- Second, words that co-occur together and also closer to the semantic meaning will cluster together in the high-dimensional vector space where the distance between the words are closer. So the FP vectors of two similar words will also be closer.
Second point is very important to understand how tokens get a “numerical” value and how this numerical value has a semantic meaning associated with it (they are not random). So when you calculate a score based on these values, you are trying to manipulate the high-dimensional vector space by contextually navigating it based on the sequence of words.
Attention mechanism uses these numerical value to understand the contextual relationships of the words in a sequence.

These models are called Distributional Semantic Models or DSM for short. Distributional semantic models are a class of models that represent the meaning of words based on the distributional properties of their contexts in a large corpus of text. Here are some examples of distributional semantic models:

Word2Vec: Word2Vec is a neural network-based model that learns dense, continuous vector representations (embeddings) for words based on their co-occurrence patterns. Word2Vec embeddings are widely used in sentiment analysis, and machine translation.
GloVe: GloVe (Global Vectors) is a count-based model that learns vector representations for words based on the co-occurrence statistics of words. GloVe embeddings have been shown to be effective text classification and named entity recognition.
FastText: FastText is a neural network-based model that extends Word2Vec by representing words as bags of character n-grams, in addition to their full-word embeddings. This allows FastText to capture information about subword units, such as prefixes and suffixes, which can be useful for handling out-of-vocabulary words and morphologically rich languages.
Latent Semantic Analysis (LSA): LSA is a count-based model that uses SVD and learns low-dimensional vector representations for words based on the co-occurrence patterns of words. LSA is a linear model, meaning that it assumes that the meaning of a word is a linear combination of the meanings of its context words. LSA has been used for information retrieval.
Hyperspace Analogue to Language (HAL): HAL is a DSM that uses SVD and has been used for lexical substitution and word sense disambiguation.

The type of input embedding used after BPE depends on the specific task and the requirements of the model. However, in general, Transformers use Fixed Embeddings such as Word2Vec. Word2Vec comes in two flavors. Skip Gram and Continous Bag of Words (CBoW).

Skip-gram is generally considered to be better than CBOW for training word embeddings when working with large amounts of text data or when the target words are rare, because it can handle more noise in the data and can capture more nuanced relationships between words.

Skip Gram: The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence. Take the text sequence “problems”, “turning”, “into”, “banking”, “crises”, “as” as an example. Let’s choose “into” as the center word and set the context window size to 2.

Given the center word “into”, the skip-gram model considers the conditional probability for generating the context words that are no more than 2 words away from the center word. This can be expressed mathematically as:

P(′problems′,′turning′,′banking′,′crises′ ∣ ′into′)

Under the assumption that the context words are independently generated given the center word (i.e., conditional independence), the above conditional probability can be rewritten as:

P(′problems′∣′into′) × P(′turning′∣′into′) × P(′banking′∣′into′) × …

Thus, the conditional probability of generating any context word given the center word can be modeled by a softmax operation on vector dot products.

where the vocabulary index set V ={0,1,…,|V|−1}. Consider a text sequence of length T, where the word at time step t is denoted as w(t). Assuming that context words are independently generated given any center word, the likelihood function of the skip-gram model for a context window size of m is the probability of generating all context words given any center word, which can be expressed mathematically as:

where j is the index of the context word relative to the center word, and the probability P(w(t+j) | w(t)) is given by the softmax function:

Here, v_{w(t+j)} and u_{w(t)} are the embedding vectors for the context word w(t+j) and the center word w(t), respectively. The sum in the denominator is taken over all words in the vocabulary, denoted by the index set V.

Concurrent Learning Process

Integrated Training: An interesting aspect of LLM training is that the input embedding space is constantly adjusted. The embeddings for tokens are learned as part of the model’s overall training process. The embedding matrix is one of the model’s parameters and is updated through backpropagation as the model learns from its training data.

Initialization to Optimization: Initially, the embedding matrix is often either randomly initialized or through embedding techniues like GloVE to cold start the vector. As training progresses, each training example provides information not just about the high-level task (e.g., predicting the next word, text classification) but also about how individual tokens relate to each other and to the task at hand. The optimization algorithm (such as Stochastic Gradient Descent, Adam) updates the embedding weights alongside all other model weights to minimize the loss function.

End-to-End Learning: This means that the learning of token embeddings is not a separate pre-training step but occurs end-to-end during the training phase of the model. The embeddings evolve and improve as the model gets better at its task(s), capturing nuanced semantic and syntactic relationships between tokens based on their contexts within the training data.

Continuous Adjustment

Throughout Training: The embeddings are adjusted continuously throughout the model’s training process. Every batch of data and every epoch can contribute to refining these embeddings, making them more representative of the tokens’ roles and meanings in the language.

Implications

Pre-Defined Embeddings: The Pre-defined embeddings are not strictly required. I have seen that having the pre defined embeddings is a great boost for the down stream training of LLM though. Unlike some approaches that might use pre-trained embeddings (e.g., Word2Vec, GloVe) as a fixed input layer, in LLMs like GPT-3 or GPT-4, the embeddings are learned from scratch and refined throughout the training process. This approach allows the embeddings to be highly customized to the specific dataset and tasks the model is trained on.

Adaptive Learning: This integrated learning approach ensures that token embeddings are adaptive and evolve as the model learns, capturing complex patterns that are specific to the model’s training data and objectives.

Step 3 — Positional Encoding

Unline RNNs, where a sequence of tokens are fed to the model one after another, the Transformer based Generative AI models do NOT process the sequence of tokens in the order it is sent. Instead, Transformers process the tokens in parallel. Since the tokens are processed in parallel, the architecture requires an indicator of the order of tokens. Or in other words, the model architecture requires a clever way to know the “position” of the token w.r.t all other tokens. This is where Positional Encoding comes into play. (Note that recurrence or convolutional models do not need this.)

There are different types of positional embedding schemes used in Transformers, some of which are:

Learned positional embeddings: In this scheme, the model learns a separate embedding for each position in the input sequence. These embeddings are learned along with the other parameters of the model during training.
Fixed positional embeddings: In this scheme, the positional embeddings are fixed and do not change during training. The embeddings can be precomputed based on some function of the position, such as sine and cosine functions.
Hybrid positional embeddings: In this scheme, a combination of learned and fixed positional embeddings is used. For example, the model might use fixed embeddings for positions up to a certain threshold and learned embeddings for positions beyond that threshold.
Relative positional embeddings: In this scheme, the positional embeddings capture the relative positions of the tokens in the input sequence. This allows the model to generalize to longer sequences than the ones seen during training.
Convolutional positional embeddings: In this scheme, convolutional neural networks are used to compute the positional embeddings. The convolutional layers allow the model to capture local dependencies between nearby tokens, while the positional embeddings capture the global position information.

The Vaswani et al. paper talks about positional encoding which is a vector that is added to each input embedding (after step-2). These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.

Note: current architectures do not use the positional encoding from Vaswani papers. There are further innovations in positional encodings which are significant improvements from sinusoïdal functions.

The traditional encoding is based on sinusoidal functions of different frequencies and phases that represent the relative positions of the tokens in the input sequence.

Here is the visualization of the token position to its embedding dimension using traditional positional encoding.

Mathematically, the positional embeddings can be represented as a matrix E ∈ R^L.d, where L is the maximum sequence length and d is the dimensionality of the embedding vectors. Each row of the matrix represents the embedding vector for a specific position in the sequence.

To incorporate the positional embeddings into the input representation for a given position pos in the sequence, the word embedding vector x_pos and the corresponding positional embedding vector p_pos are added element-wise, resulting in a new vector z_pos:

z_pos will be the output from step-3.

I have written a detailed article on a step-by-step explanation on how positional encoding plays out during inference here: “Math Behind Positional Encoding”

Architecture Insight Prior to Step 4 — Inputs and Outputs

At this point, it is important to learn the distinction between inputs that are sent to the encoder layer of the transformer and the inputs that are sent to the decoder layer. Let’s consider a machine translation task where the goal is to translate a sentence from English to French using a Transformer model.

English sentence (input to the encoder): “I love learning languages.”

French translation (target sequence): “J’aime apprendre les langues.”

INSIGHT: The inputs to the encoder are sent as a batch of samples. A sample can be thought of as a paragraph. Each sample is made of tokens (or words). A batch is set to a sample size of 64 (64 samples is an arbitrary number) and each sample is predetermined to have the same token length.
If a paragraph is shorter than the sample length, it is filled with blank tokens to make up the length. If a paragraph is longer, then it gets broken into multiple batches.

During training, the target sequence is used as input to the decoder, but it is shifted by one position. We also add special start-of-sequence (SOS) and end-of-sequence (EOS) tokens to the target sequence.

Target input to the decoder (shifted): [SOS] J’aime apprendre les langues. [EOS]

Now, let’s see how the decoder processes this input during training:

The decoder receives the shifted target sequence as input along with the encoder’s output (which represents the English sentence).
The decoder generates a probability distribution over the French vocabulary for each position in the output sequence. These distributions represent the model’s prediction for the next token in the sequence at each position.
The actual target output sequence, including the EOS token, is used to calculate the loss: “J’aime apprendre les langues. [EOS]”

Suppose the model’s predicted output sequence after training is: “J’aime apprendre les langues.”

During inference (post training), when translating a new English sentence, the decoder starts with the SOS token and generates the output sequence one token at a time based on the previous tokens and the encoder’s output.

For example, let’s translate a new sentence:

English sentence (input to the encoder): “She enjoys studying history.”

The decoder starts with the SOS token and generates the French translation step by step:

Input: [SOS]
Output: “Elle”
Input: [SOS] “Elle”
Output: “aime”
Input: [SOS] “Elle” “aime”
Output: “étudier”
Input: [SOS] “Elle” “aime” “étudier”
Output: “l’histoire”
Input: [SOS] “Elle” “aime” “étudier” “l’histoire”
Output: [EOS]

Final generated French translation: “Elle aime étudier l’histoire.”

Step 4 — Q K V Matrix in Encoder

This step is critical to understand overall. Each input token is forked into Query, Key, and Value or Q, K, and V vectors by multiplying the same token with three different weight matrices. Initially, the weight matrix is all randomly initiated.

INSIGHT: We maintain a Wq, Wk, Wv as a 3 distinct weight matrices which are learned parameters across the entire corpus of training that are optimized using backpropagation and gradient descent optimization. The weight matrix gets updated only at end of each batch. Each Token X is multipled by the weight matrix Wq, Wk, Wv to get Q, K, V vectors.

The weight matrix for Wq, Wk, and Wv need not have to be the same dimension. You can have a smaller dimension for Wq and Wk and a larger dimension for Wv.

During each training iteration, the model takes a batch of input sequences and performs a forward pass through the transformer layers, which involves computing the self-attention scores using the Q, K, and V matrices, followed by feedforward network layers. The model's output is then compared to the ground truth labels using a loss function, and the gradients of the loss with respect to all the model parameters, including the Q, K, and V matrices, are computed using backpropagation.

Since the input embeddings depend on the specific input sequence, the Q, K, and V vectors derived from these embeddings will also be unique for each sample. However, the Wq, Wk, and Wv weight matrices are shared across all samples, allowing the model to generalize across different input sequences.

In other words, the Q, K, and V are retained only at the sample level and is unique for every sample in a batch (But Wq, Wk, and Wv will be for the entire corpus).

But why is this needed? What are we accomplishing by creating Q, K, or V vectors?

We need to brush up on a tiny bit of math to understand the implication of this. Let's recap our understanding of the dot products of two vectors.

Geometric interpretation: The dot product of two vectors a and b is equal to the magnitude of a times the magnitude of b times the cosine of the angle between them. This means that the dot product measures the degree of alignment between the two vectors. If the vectors are parallel, the dot product is positive and equal to the product of their magnitudes; if they are anti-parallel, the dot product is negative and equal to the negative of their product; if they are perpendicular, the dot product is zero.
Projection interpretation: The dot product of two vectors a and b is equal to the length of the projection of a onto b times the magnitude of b. This means that the dot product measures the component of a that is parallel to b. If a and b are unit vectors (i.e., vectors with length 1), the dot product is equal to the cosine of the angle between them, which represents the degree of similarity between the two vectors.
Algebraic interpretation: The dot product of two vectors a and b is equal to the sum of the products of their corresponding components. This means that the dot product measures how much the two vectors “overlap” in each dimension. If the vectors have positive components in the same dimensions, the dot product is positive; if they have negative components in the same dimensions, the dot product is negative; if they have opposite signs or zero components in any dimension, the dot product is zero.

So, now imagine that every word is a vector of floating point of size 512 as its dimensions (input embedding into a 512-dimensional vector space). This means you need a weight matrix that is 512 x M to do the matrix multiplication. Let us say the size of M is 64.

So we have X, which is 1 x 512, and Wq, which is size 512 x 64. The resultant vector Q will be of size 1 x 64.

You now have a fixed vector X of size 512 from the input embedding space and initially a random weight matrix Wq. When you multiply the vector into a matrix, you effectively find the dot product of X with every column of Wq (Note that Wq is for the entire corpus).

To visualize this, think of many small vectors that you are tuning in Wq that are trying to amplify, nullify or negate the fixed vector to find a better position for X from the input distributional semantic space into a lower dimensional space prepared to be curated by Attention.

The key insight is that, since Wq is a learned parameter, you are asking the neural network to find the optimal vector in every column in the weight matrix, which, when multiplied by a given word X will give you the right Query Q such that Q retains the essence of its 512 high-dimensional contexts into a 64 dimensional lower space that is re-learned using a downstream Attention mechanism.

INSIGHT: In effect, we compress the dimension from 512 dimensions to a 64-dimensional space and “re-arrange” the words further based on a downstream Attention mechanism. Here the weights are learnt through backprop. Hence, we move from a Distributional Semantic Space to a space that is acutely curated through Attention mechanism during this transformation.

Also, note that we will have several of Wq, Wk, Wv across the deep learning layers (and different Attention heads) and not just one. The combination of all these lower-dimensional weight matrices collectively represents different subspaces. You can visualize Wq as a collection of vectors across multiple layers as follows:

Each row in the above diagram can be considered as a weight matrix Wq from a single layer. Multiple rows represent Wq stacked across multiple layers. So imagine 64 columns, with each column containing a 512-dimensional vector. Each of these vectors will be carefully shaped to point in a 512-dimensional space by gradient optimization to represent the context of the entire corpus!

Step 5 — Score

In this step, we multiply the lower dimensional query vector Q, with the transpose of “every” K (the set of key vectors) in the sample. This serves the purpose of computing scores that measure the similarity between the query and each key.

Specifically, when we compute the attention score between a query vector Q and the transpose of every key vector K, we are setting up the collective weights of each query with all the keys.

Multiplying Q with every K ensures we compute a weighted score for each Key, which is necessary to obtain the weights and the weighted sum of Values. Without this step, we would be unable to compute the attention scores, and the attention mechanism would not work.

So why do we do this? Remember that Q represents a lower-dimensional index for the input embedding of a word from a 512-dimensional space into a 64 dimension. Also, remember that Q and K were created from the same fixed input embedding X by multiplying them with the weight matrix Wq and Wk, respectively.

So when you compute the dot product of Q with every K, you are asking to amplify, nullify or negate the numerical value of a word in a sentence with all other words in a given sample (including itself) based on the context of the word.

Words are semantically related to each other based on it’s meaning and also the proximity of the words in the sentence. At the highest level, the semantic meaning can be represented as follows:

Synonyms: leap/spring/bound/hop/ bounce
Antonyms: fast/slow
Categories: Fruit/Banana
Connotations: Young vs Childlike, Passionate vs Volatile
Homophones: Suite vs Sweet, Serial vs Cereal
Homographs: Tear (drop) vs Tear (the paper)
Homonyms: Bank (river) vs Bank (Money)

Words also differ in meaning based on usage as follows:

Conceptual meaning: This refers to the basic dictionary definition of a word, which describes the category or concept that the word represents. For example, the conceptual meaning of “dog” is a domesticated mammal that is commonly kept as a pet or used for hunting.
Connotative meaning: This refers to the associations or emotions that a word may evoke beyond its conceptual meaning. Connotative meaning is often influenced by cultural and personal factors, and can vary between individuals and contexts. For example, the word “dog” may have positive connotations for someone who loves animals, but negative connotations for someone who has been bitten by a dog.
Collocative meaning: This refers to the patterns of words that tend to occur together with a given word, which can influence the meaning of the word in context. For example, the word “strong” may have different collocative meanings depending on the words that accompany it, such as “strong coffee” vs. “strong wind”.
Affective meaning: This refers to the emotional or attitudinal connotations of a word, which may be positive, negative, or neutral. Affective meaning can be influenced by the speaker’s tone of voice, body language, and other contextual factors. For example, the word “love” may have a positive affective meaning, while the word “hate” may have a negative affective meaning.
Social meaning: This refers to the social and cultural associations of a word, which may reflect social norms, values, and power dynamics. Social meaning can be influenced by factors such as age, gender, ethnicity, and social class. For example, the word “boss” may have different social meanings depending on the speaker’s relationship to the person in question.
Reflected meaning: This refers to the associations or implications that a word may have based on its use in a particular context, even if those associations are not part of its conceptual meaning. Reflected meaning can be influenced by the speaker’s intentions, the audience’s expectations, and the larger cultural and historical context. For example, the word “wall” may have reflected meanings related to division, protection, or exclusion, depending on the context.
Thematic meaning: This refers to the underlying or implicit themes or messages that are conveyed through the use of language. Thematic meaning can be influenced by factors such as genre, discourse structure, and narrative conventions. For example, a news article about a political protest may have thematic meanings related to power, resistance, and social change.

While distributional semantics could catch a good portion of the relationship between the words based on their co-occurrence, it could never do justice to all the semantic relationships and differing meaning a word could have.

The dot product of Q and every K is the best way to fix this, not just because it amplifies, nullifies, or negates the informational content represented by the embedding BUT also because the Wq and Wk are extremely dynamic and are learned parameters across all the corpus with Attention weights.

INSIGHT: Words placed opposite in the embedding space will negate each other's context. Orthogonal words will nullify each other's context, and words that are similar in context will amplify each other.

Step 6 — Multi-Head Attention

Finally, we are here. Now, let's pay attention to how Attention helps in capturing the semantic relationship and differing meaning of a word based on situational context and also the influence of words in sentences far away.

Interpreting the use of Attention in Transformers can be tricky, as there are many variations and nuances to the mechanism. In general, however, Attention in Transformers aims to allow the model to selectively attend to different parts of the input sequence, based on the current context and the task at hand.

This can help the model to better capture long-range dependencies and complex relationships between different parts of the sequence, and to produce more accurate and informative outputs.

The scope of attention is at the sample level. This is very important to understand. Unlike Wq, Wk, and Wv, which are all globally scoped for the entire corpus, the attention mechanism operates at the sample level within each batch.

We compute attention scores between pairs of tokens in an input sequence. Each sample has its own attention matrix, which is specific to the relationships among the tokens in that sample’s input sequence. There will be N attention matrix (if it’s single-head attention) for N samples in a batch. Note that the Attention matrix gets reset after every batch.

Now, let’s take a look at how attention is calculated:

We already discussed how we derive the dot product of each Q with every K (the transpose) and the reason for doing so. The Attention we use is called scaled self-attention. Let’s understand why we scale it with the sqrt of the dimension of the Key vector.

When you take the dot product of Q and K, the magnitude of the dot product can vary depending on the dimensionality of the vectors. In particular, as the dimensionality increases, the dot product tends to grow larger, which can cause numerical instability and make it difficult to optimize the model. To mitigate this issue, the Attention mechanism often includes a scaling factor that divides the dot product by the square root of the dimensionality.

Specifically, scaling the dot product by the square root of the dimensionality ensures that the average dot product remains approximately constant, regardless of the dimensionality. This means the attention scores will have similar values across different dimensions, making them more interpretable and easier to compare.

Why should we apply a Softmax?

The Softmax function ensures that the attention weights sum to 1. This means that the context vector is a weighted sum of all the Query-Key pairs(q1*k1, q1*k2, q1*k3…), where the weights add up to 1. This ensures that the model attends to all the relevant information in the input sequence, and down-weights or ignores irrelevant or noisy information.
The Softmax function produces a probability distribution over the Query-Key pairs. This allows the model to learn a distribution of importance for the different parts of the input sequence. By learning a probability distribution, the model can capture more complex relationships and dependencies between different parts of the input sequence.
The Softmax function is differentiable, which makes it easy to compute the gradients during backpropagation. This makes it possible to train the model using gradient-based optimization methods, such as stochastic gradient descent (SGD).
The Softmax function is a natural choice for modeling probability distributions. By using the Softmax function, we can interpret the attention weights as the probabilities of selecting each Query-Key pair.

Why should we take the dot product to the value vector V?

The dot product of the softmaxed’ attention score with the value vector V ensures that the model considers the influence of the meaning of the current word. The influence is denoted by the softmax score.
It’s important to notice that the softmax part of the attention score is to show how the current word is semantically and logically relevant to all other words in the batch.
The dot product with the value will now ensure how this softmax part which is the probability distribution relates to the current “meaning” of the word on hand.

The Context of the ith Query which is the context vector Z can be denoted as follows:

Now, let’s understand a bit about Multi-head Attention. Multi-head attention is a variant of the attention mechanism that involves splitting the Q, K, and V vectors into multiple smaller vectors and computing the attention scores and context vectors separately for each subset of vectors. The resulting context vectors are then concatenated and passed through a linear projection to obtain the final output.

The main advantage of multi-head attention is that it allows the model to attend to different information in parallel and from multiple perspectives. The input matrix is split columnar for multi-head attention, and each split is sent to a different head.

Input is split columnar to send to different attention heads.

So Xh1, Xh2, Xh3… Xhn will be the splits based on the number of heads you will have on the same layer of the deep neural network. So you will now have a smaller weight matrix per head, but the size of Q, K, and V will not change. So the computation cost of a multi-head is similar to a single-head with full dimensionality.

Each attention head will have its own Q, K V

Why do we do this? IMO, the answer lies in mathematics again. Here is my take.

Imagine that your input vector X is orthogonal to a column Wq. In this case, when you take the dot product, your Query vector will become zero! In this scenario, you may have learned nothing about the context or meaning of the words (This may be intentional).
But instead, if you took a columnar slice of the orthogonal vector X (Xh1, Xh2… Xhn), then there is information in that slice that can result in a non-zero value (the slice may not be orthogonal) that can be used to compute better attention scores.
The probability of all N slices being orthogonal is phenomenally lower.
With a single attention head, averaging the information subspace inhibits the opportunity to learn about the input embedding more intimately. This is important.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

There are also engineering and empirical reasons why this is useful.

With a multi-head, you get more attention heads than a single-head. So with a shallow layer, you get the same number of attention heads.
384 layer, single-head attention is equivalent to 48 layers, 8-head attention. Just fewer layers with more attention heads perform as well as more layers and one attention head.
Multi-head reduces the number of layers by retaining the same number of attention heads. Lesser layers result in a more stable network.

The output of each Attention head is a Context vector which is the sum of all attention scores 𝜶 of each token.

The context vector now contains the context of all tokens in a sample ready to be sent to the Feed Forward Layer. You concatenate the context vectors from all heads to stitch together a multi-head attention output as follows:

Here is an example illustration.

The final step shows the summation of all the Attention scores into a single context vector. Wo here is a weight matrix that is used to concatenate all the context together, which is also a learned parameter. The context vector is fed to the Feed Forward Neural network.

To summarize the encoder architecture.

The encoder is composed of a stack of N identical layers.
Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise, fully connected feed-forward network.
A residual connection around the two sub-layers is used, followed by layer normalization.
The output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
To facilitate these residual connections, all sub-layers in the model and the embedding layers produce outputs of dimension dmodel = 512.

A high-level overview of the Decoder

The decoder side has almost the same architecture as the encoder side except for two changes.

A masked attention layer.
Import of key and value vector from the encoder side.

Let’s first understand the inputs and outputs of the decoder.

Inputs to the decoder:

Target sequence (shifted): During training, the target sequence (e.g., the translated sentence in a machine translation task) is provided to the decoder as input, but it is shifted by one position. This means that the decoder receives all tokens in the target sequence except for the last one. This is done to ensure that the model learns to predict the next token in the sequence given the previous tokens.
Encoder output (K, V): The hidden states produced by the encoder, which represent the input sequence or in other words the K and V vectors. This is used in the encoder-decoder attention mechanism within the decoder layers.

It is very important to understand one key distinction about the decoder.

If the input sequence length is different from the target sequence length, then the Key (and Value) length will differ from the number of Queries that is fed to the decoder (since the target sequence is different). Please spend a bit of time to understand this very important nuance.

Let’s say the input sequence for the encoder is 10 and the target sequence that is fed to the decoder is 20: Then you shall have 10 K, V pairs and 20 Q vectors.

The Softmax function in the decoder will find the attention score of 20 Qs w.r.t to 10 K each (q1 * k1, q1 *k2…. q1 * k10) all the way through (q20 * k1…. q20 * k10).

So the way you compute the context of the ith query which is the context vector Z still does not change:

where,

Ai represents the aggregated attention vector for the i-th query vector in the decoder’s output sequence.
Qi is the i-th query vector in the decoder, and there are 20 such vectors (i ranges from 1 to 20).
Kj and Vj are the key and value vectors in the encoder, respectively, with 10 pairs of these (j ranges from 1 to 10).
Attention(Qi,Kj) is the attention score between the i-th query vector and the j-th key vector, and these scores are used as weights for the corresponding value vectors Vj.
The summation sums these weighted value vectors for each query vector Qi, resulting in the final attention vector Ai.

Outputs of the decoder:

The decoder outputs a probability distribution over the target vocabulary for each position in the output sequence. Remember that the target vocabulary for GPT3 is 50,257 BPE tokens. So the logit size for decoder output is 50,257 dimensions as well. These probability distributions represent the model’s prediction for the next token in the sequence at each position.
During training, the predicted probability distributions are compared to the true target sequences, and a loss function (usually cross-entropy loss) is calculated. This loss is minimized during training to improve the model’s ability to generate correct output sequences.
During inference (i.e., when the model is used to generate output sequences for new inputs), the decoder’s input is the previously generated tokens.
The process starts with a special start-of-sequence token, and at each step, the model predicts the next token based on the previous ones. This can be done using greedy decoding, beam search, or other decoding strategies to generate the final output sequence.

During both training and inference, the output of the decoder is a probability distribution over the target vocabulary (50,257) for each position in the output sequence. These distributions represent the logits after passing through the final linear layer and softmax activation in the decoder.

In transformer models, the logits are not unique tokens themselves. Instead, they are vectors of raw scores produced for each position in the sequence, representing the model’s confidence in each possible token in the vocabulary. The softmax function is applied to these logits to produce a probability distribution over the vocabulary for each position.

Logits Matrix: For a vocabulary size V and sequence length L = 4, the logit layer produces a logits matrix Z with dimensions L×V:

Each Z_i is a vector of length V, containing the raw scores for all tokens in the vocabulary at the i-th position.

However, the process of selecting the predicted tokens from these probability distributions is different during training and inference.

During training, the model predicts probability distributions of the logits for all positions in the output sequence simultaneously, and the loss function is calculated based on the difference between the model’s predictions and the true target tokens. The model is not explicitly required to produce a single token prediction for each position during training, as the entire probability distribution is used for loss calculation and backpropagation.

During inference, the model generates the output sequence one token at a time. At each step, the decoder predicts a probability distribution over the target vocabulary for the next position in the output sequence. From this distribution, a single token is selected as the prediction for that position, using a decoding strategy such as:

Greedy decoding: Choose the token with the highest probability.
Beam search: Maintain a fixed number of candidate sequences (beams) and expand them based on the token probabilities. Prune less likely beams at each step and continue with the most probable ones.
Top-k sampling or top-p sampling: Sample a token from the top-k or top-p most probable tokens, introducing some randomness into the decoding process.

So, while the decoder’s output remains a probability distribution over the target vocabulary in both training and inference, the way these probability distributions are used to generate the final output sequence differs, with single token predictions being produced during inference.

To understand this better, let’s consider a machine translation task where the goal is to translate a sentence from English to French using a Transformer model. Suppose we have the following English sentence as input:

English sentence (input to the encoder): “I love learning languages.”

During inference, the decoder generates the French translation one token at a time. We’ll demonstrate this process using greedy decoding, beam search, and top-k sampling.

Greedy decoding:

The decoder starts with the start-of-sequence (SOS) token and generates the output sequence step by step.

Input: [SOS]
Output probabilities: [0.01, 0.03, 0.05, …, 0.4, …, 0.01] (assume the highest probability, 0.4, corresponds to “J’aime”). Again, the size of this logit will be 50,257.
Selected token: “J’aime”
Input: [SOS] “J’aime”
Output probabilities: [0.01, 0.02, 0.03, …, 0.5, …, 0.01] (assume the highest probability, 0.5, corresponds to “apprendre”)
Selected token: “apprendre”

This process continues until the end-of-sequence (EOS) token is generated or a maximum output length is reached.

Beam search (assume beam size = 2):

The decoder maintains two candidate sequences (beams) at each step, and expands them based on the token probabilities.

Step 1:

Beam 1: [SOS] “J’aime” (probability: 0.4)
Beam 2: [SOS] “Je” (probability: 0.3, assume “Je” is the second most probable token)
Step 2:
Beam 1: [SOS] “J’aime” “apprendre” (cumulative probability: 0.4 * 0.5 = 0.2)
Beam 2: [SOS] “Je” “adore” (cumulative probability: 0.3 * 0.4 = 0.12, assume “adore” is the most probable token after “Je”)

The process continues with the most probable beams until the EOS token is generated or a maximum output length is reached.

Top-k sampling (assume k = 3):

The decoder samples a token from the top-k most probable tokens at each step, introducing randomness.

Input: [SOS]
Top-3 output probabilities: [(“J’aime”, 0.4), (“Je”, 0.3), (“J’adore”, 0.2)]
Sampled token: “J’aime” (chosen randomly based on their probabilities)
Input: [SOS] “J’aime”
Top-3 output probabilities: [(“apprendre”, 0.5), (“étudier”, 0.3), (“pratiquer”, 0.1)]
Sampled token: “étudier” (chosen randomly based on their probabilities)

The process continues until the EOS token is generated or a maximum output length is reached.

In each of these decoding strategies, the decoder generates output tokens one at a time based on the probability distributions over the target vocabulary.

Note: The size of the logit layer in a decoder depends on the size of the target vocabulary. in case of GPT3 the BPE tokenization reduces the overall vocabulary to 50,275 tokens and this is the size of the logit layer.

NOTE of Efficiency: Recomputing the entire sequence for each new token can be inefficient. In practice, transformer models often use caching to avoid redundant computations. Here’s how:

Cache Key and Value States:

During each step of generation, the model caches the key and value states computed for the self-attention and encoder-decoder attention layers.
For generating the n+1th token, the model only computes the new states for the n+1th position and reuses the cached states for the previous nnn positions.

Step 7 — The Masked Attention Layer

The masked attention layer on the decoder side works by masking out the future positions in the input sequence.

This is done to prevent the decoder from attending to future positions, which would allow it to “cheat” and predict the next word in the sequence. The masking is done by setting the attention weights for future positions to a very small value, such as -1e9. This effectively prevents the decoder from attending to those positions.

The masked attention layer is important for two reasons.

First, it prevents the decoder from overfitting the training data. If the decoder were able to attend to future positions, it would be able to memorize the training data and repeat it back during testing. This would not be a very useful model.
Second, the masked attention layer helps the decoder to focus on the current context. By preventing the decoder from attending to future positions, the masked attention layer forces the decoder to focus on the words that have already been generated. This helps the decoder to generate more accurate and fluent text.

How does Masked-Attention work?

Assume we have an input sequence of length N, which has been encoded into a sequence of key vectors K with dimensionality d_k, and a sequence of value vectors V with dimensionality d_v. We also have a decoder that generates an output sequence of length M, where M is typically shorter than N. The decoder generates the output sequence one token at a time, by attending to the encoded input sequence and the previously generated tokens.

During decoding, we need to ensure that the decoder attends only to the previously generated tokens, and not to the future tokens that have not yet been generated. To achieve this, we use a mask matrix that has dimensions (M, N) and is filled with ones and negative infinity. The mask matrix is used to mask out the future positions in the attention scores, by setting their values to negative infinity.

NOTE: You cannot have zeroes to denote a mask as zero is considered as information.

Here is a depiction of self-attention where the transduction sequence is learning on itself (this is not the translation use-case).

The masked attention layer can be defined mathematically as follows:

Where Q is the Query vector of the current token (dot product X, which is the input embedding from the target sequence and Wq in the decoder) , K and V are imported from Encoder.

NOTE: You can also add the Mask instead of elementwise multiplication. In that case you use zeros for pass-through (instead of one) and retain negative infinite for blocking.

You can notice with the following illustration how the mask will block the Query from looking ahead.

Step 8 — Decoder side Q with K, V imports.

While the remaining layers of the decoder are exactly the same as the encoder, an important distinction is to understand that the decoder does not generate its own Key or Value vectors. Instead, it borrows the K and V from the encoder side.

The decoder’s attention mechanism only needs to attend to the encoder’s output sequence and the previously generated tokens of the output sequence. Therefore, the decoder uses the same K and V vectors generated by the encoder for all the tokens in the output sequence. This is different from the encoder, which generates a unique set of K and V vectors for each token in the input sequence.

By sharing the K and V vectors across all the tokens in the output sequence, the Transformer model reduces the computational cost of the attention mechanism and allows for faster training and inference.

Additionally, sharing the K and V vectors across all the tokens allows the model to attend to the same parts of the input sequence regardless of the current output token, which can improve the overall coherence and consistency of the output sequence.

INSIGHT: Think of the decoder side as a query lookup mechanism. The decoder creates a query vector Q for every token in the sample that it needs to decode. The decoder scores the query Q from decoder side against the keys K from the encoder. This provides insight to the decoder on the most useful context to attend from the input sequence fed to the encoder. Once the weights are scored in conjunction with the key from the encoder, the decoder needs to find the value vector that serves the most semantically and contextually useful meaning for decoding.

End Game

The remaining layers of the decoder are exactly the same as the encoder. Here are some important things to know.

The end-to-end process for training:

Data preparation: Collect a parallel corpus of source and target language sentences (e.g., English-French sentence pairs). Tokenize the sentences into words or subwords using a suitable tokenizer. Add special start-of-sequence (SOS) and end-of-sequence (EOS) tokens to the target sentences.
Batching: Organize the tokenized sentences into mini-batches for more efficient training. Each mini-batch contains a set of source and target sentence pairs.
Masking: Create masks for the input sequences (source and target) to prevent the model from attending to padding tokens.
Encoder input: Pass the source sentences through the Transformer’s encoder. The encoder consists of multiple layers with multi-head self-attention, layer normalization, position-wise feedforward networks, and another layer normalization. The encoder outputs a sequence of hidden states representing the input sequence.
Decoder input: Pass the target sentences (shifted by one position) through the Transformer’s decoder. The decoder also has multiple layers, each with multi-head self-attention, layer normalization, encoder-decoder attention, another layer normalization, position-wise feedforward networks, and final layer normalization.
Encoder-decoder attention: During the encoder-decoder attention step in the decoder layers, the decoder attends to the hidden states from the encoder output. This helps the decoder to focus on relevant parts of the source sentence when generating the target sentence.
Output probabilities: The output of the final decoder layer is passed through a linear layer followed by a softmax activation to produce a probability distribution over the target vocabulary for each position in the output sequence. These distributions represent the model’s predictions for the next token in the sequence at each position.
Loss calculation: Compare the predicted output sequence with the actual target sequence (including the EOS token) using a suitable loss function, such as cross-entropy loss. The loss measures the difference between the predicted probabilities and the true target token probabilities.
Backpropagation: Compute the gradients of the loss with respect to the model’s parameters (weights and biases) using the backpropagation algorithm.
Optimization: Update the model’s parameters using an optimization algorithm, such as Adam or SGD, based on the computed gradients. This step adjusts the model’s parameters to minimize the loss.
Repeat: Repeat steps above for multiple epochs or until a stopping criterion is met (e.g., the loss converges, or the model’s performance on a validation set stops improving). Process different mini-batches in each iteration.
Validation: Periodically evaluate the model’s performance on a separate validation dataset to monitor its generalization ability and prevent overfitting. Adjust hyperparameters or apply early stopping based on the validation performance.

How does the transformer deal with input and output sequences with differing lengths?

In the encoder-decoder attention mechanism, the decoder queries (Q) are used to attend to the keys (K) and values (V) from the encoder.
This process allows the decoder to compute context-sensitive weighted sums of the encoder’s value vectors based on the similarity between the decoder’s query vectors and the encoder’s key vectors. The attention mechanism does not rely on the assumption that the input and output sequence lengths are the same. It calculates the attention scores and context vectors independently for each position in the output sequence, based on the relationships between the decoder’s queries and the encoder’s keys.
Remember that the attention is calculated for a given Q on all the [K, V] pairs. So the decoder does not care about the size of the input or the output sequence as it is computing the current Q on all the [K, V] pairs from the encoder.
So, even though the number of keys and values in the encoder and decoder may differ due to different input and output sequence lengths, the Transformer can still effectively calculate the encoder-decoder attention and generate appropriate output sequences.

Advantages of Multi-head attention attending to different subspaces:

Adaptability: The attention mechanism can adapt to the specific structure and dependencies in each input sequence, allowing the model to learn and represent complex relationships between tokens. This adaptability enables the model to better handle varying input sequences and generalize to unseen data.
Context-aware representation: Different attention matrices for each sample enable the model to focus on different parts of the input sequence when generating the context vectors. This context-aware representation allows the model to better capture the meaning of each token in the context of the entire input sequence, improving the output quality, especially in tasks such as machine translation or text summarization.
Parallelism: In multi-head attention, each attention head can learn different relationships between tokens simultaneously, resulting in multiple attention matrices for each sample. This parallelism allows the model to capture different types of relationships (e.g., syntactic, semantic, or long-range dependencies) more effectively, enhancing its overall performance.
Interpretability: Different attention matrices for each sample can provide some level of interpretability by visualizing the attention weights. This can help in understanding how the model is attending to different parts of the input sequence and might reveal insights into the learned relationships between tokens.

Phew, that’s it for now, folks. This blog has run its course. I will dive deep into more math in the next set of blogs. Happy training.