PART 1 — The First GAI-LLM: Original “Transformer” Embodied the Foundational Concepts of Modern GAI(LLM/LVM/LAM/LTMs), RAG, Prompt Tuning, and More

17 min readAug 2, 2024

You can refer to my other two articles in this series to gain a better understanding and grasp of this world of GAI:

PART 2 — The “Attention-Based” Model Chronicles: A Comedic Breakdown,link : https://medium.com/@trishulchowdhury.23/part-2-the-attention-based-model-chronicles-a-comedic-breakdown-389adb869d25
PART 3 — Decoding Attention: The Magic Behind Generative & Contextual AI Models (Transformer, Llama, GPT, BERT, etc.) ,link :https://medium.com/@trishulchowdhury.23/part-3-decoding-attention-the-magic-behind-generative-contextual-ai-models-transformer-8dc92d35d8e0

Content:

🌐 Understanding current edge of GAI(Generative) and CAI(Contextual) in the context of Narrow AI (ANI), General AI (AGI), and Super AI

🧩 Demystifying Common Misconceptions

🤔 Contextual versus generative nature of LLMs
🤔 LLM/LVM/LAM/LTM can be both GAI & CAI — Don’t be confused
📚 Retrieval-Augmented Generation (RAG), and prompt tuning fundamentally rooted in the original Transformer architecture
🔄 Why the Original Transformer is Considered Both GAI and LLM?
✍️ Prompt Engineering is not Prompt Tuning

📘 INTUITIVE Training Process of the Original Transformer with an Example

🔗 Connect the dots with Retrieval-Augmented Generation (RAG)
🔗 Connect the dots with prompt tuning
🔄 Intuitive understanding of Cross-Attention (Encoder-Decoder Attention)
💭 Intuition of Causal Language Model (CLM)

“All generative LLMs are part of generative AI, but not all generative AI models are LLMs. LLMs can be either generative or focused on understanding, depending on their architecture and training tasks. This distinction helps in choosing the right model for specific applications, whether it’s for generating human-like text or understanding and analyzing language”
“If we broadly divide the current AI in terms of capability into three categories — Narrow AI (ANI), General AI (AGI), and Super AI (ASI) — then BERT and Transformer models certainly fall under ANI, while models like LLaMA 3 and GPT-3.5/4 fall under AGI when considering their scale and versatility. However, when we consider their generative (GAI) and contextual (CAI) nature with respect to large language models (LLMs), Transformer can be seen as the first Generative and Contextual AI. Based on the foundational principle of Transformer (Attention), various large models for vision, audio, and time series have been developed.”

Demystifying Common Misconceptions

In this article, I aim to address some prevalent misconceptions within the AI community regarding General AI (GAI), the Large Language Model (LLM) ecosystem, and the contextual versus generative nature of LLMs. We will provide a detailed mapping of the Transformer’s attention mechanisms to current edge GAI advancements.

The AI community often heralds concepts like (LLMs), Retrieval-Augmented Generation (RAG), and prompt tuning as groundbreaking innovations. However, these advancements are fundamentally rooted in the original Transformer architecture introduced by Vaswani et al. in 2017, the first true Generative AI-Large Language Model (GAI-LLM). The Transformer’s encoder-decoder framework established the core mechanisms of attention, context understanding, and sequence generation that underpin these modern advancements.

As we all know the original Transformer model, introduced in the groundbreaking paper “Attention is All You Need,” revolutionized natural language processing by incorporating the self-attention (Full & Masked), cross-attention mechanism. This innovation allowed the model to process sequences in parallel, significantly improving both performance and scalability.

Before we dive in, here are a few key concepts to remember:
I know it’s basic, but they often get mistaken.

🔍Attention: An algorithm to understand the context of the language.

Encoder Self-Attention: This mechanism allows each token in the input sequence to attend to all other tokens, capturing contextual relationships effectively. This type of attention is referred to as full attention.
Decoder Self-Attention: In the decoder, self-attention is masked to ensure that each position can only attend to previous positions, not future ones. This is crucial for autoregressive tasks like language modeling and is known as masked self-attention.
Encoder-Decoder: This mechanism enables the decoder to attend to the encoder’s output, allowing the model to incorporate information from the entire input sequence while generating each token in the output sequence. This is also called Cross — Attention

🔄 Encoder and Decoder: Components/models that use the attention mechanism differently to understand the context of language.

Models: Transformers, BERT, GPT, LLaMA, etc., use these encoder and decoder components innovatively. Some focus solely on contextual understanding (BERT), while others also add generative power (LLaMA, GPT, original Transformer)

Why the Original Transformer is Considered Both GAI and LLM?

The original Transformer can be viewed as both a Generative AI (GAI) and a Large Language Model (LLM) due to its ability to generate and understand language effectively — The original model “generates” a translation of a given input text.

Let’s now dive deep into the training and inference process of the original Transformer model and understand why it’s the first Generative AI — LLM

Training Process of the Original Transformer with an Example

The training of the original Transformer’s encoder and decoder involves a translation task in a supervised manner.

Example:

🇬🇧 Input: “I love science.” (English)
🇪🇸 Output: “Amo la ciencia.” (Spanish)

During the training process of this deep neural network, the model learns the contextual understanding of the language by adjusting its weights based on these input-output pairs.

Step-by-Step Process with an example

1. Tokenization

· Encoder Input Tokenization

Input: “I love science.”
Tokenized: [“I”, “love”, “science”, “.”]
Converted to IDs : [ID1, ID2, ID3, ID4]

Intuition: Tokenization breaks down text into smaller pieces (tokens), which are then converted to numerical IDs that the model can process.

· Decoder Input Tokenization:

Input: “Amo la ciencia.”
Tokenized: [“Amo”, “la”, “ciencia”, “.”]
Converted to IDs: [ID5, ID6, ID7, ID8]

Intuition: Similar to the encoder, the decoder input is tokenized and converted to IDs to make it understandable for the model.

** We have to use the same tokenizer for the tokenization and detokenization processes during both training and inference.

2. Embedding

· Encoder Embedding:

Converts token IDs to dense vectors.
Input IDs: [ID1, ID2, ID3, ID4]
Output: [Emb1, Emb2, Emb3, Emb4]
Mathematically:

Intuition: Embeddings transform sparse token IDs into dense vectors, capturing semantic meaning and relationships between words.

· Decoder Embedding:

Converts token IDs to dense vectors.
Input IDs: [ID5, ID6, ID7, ID8]
Output: [Emb5, Emb6, Emb7, Emb8]
Mathematically:

Intuition: Similarly, the decoder embeddings provide meaningful dense representations of the target tokens.

3. Positional Encoding

· Encoder Positional Encoding:

Adds positional information to embeddings.
Input: [Emb1, Emb2, Emb3, Emb4]
Output: [EncEmb1, EncEmb2, EncEmb3, EncEmb4]
Mathematically:

Intuition: Positional encoding provides the model with information about the position of each token in the sequence, enabling it to capture order and structure.

· Decoder Positional Encoding:

Adds positional information to embeddings.
Input: [Emb5, Emb6, Emb7, Emb8]
Output: [DecEmb5, DecEmb6, DecEmb7, DecEmb8]
Mathematically

Intuition: The positional encoding in the decoder serves the same purpose, helping to maintain the positional context of each token.

4. Encoder Processing

✅Multi-Head Self-Attention:

For each token’s combined embedding, the process is as follows:

Input: [EncEmb1, EncEmb2, EncEmb3, EncEmb4]

Intuition: Multi-head self-attention allows the model to focus on different parts of the input sequence simultaneously, capturing various aspects of the context.

The outputs from all heads are concatenated and passed through a final linear layer.
The result is a set of self-attended embeddings for each token in the sequence.
These token-level embeddings contain rich contextual information, enhancing the model’s ability to understand and generate text.

***** 🛤️ *Little Digression*** 🛤️

Connect the dots with Retrieval-Augmented Generation (RAG)

We know in RAG it Retrieves relevant documents or pieces of information from a knowledge base based on the input query.

🌟 Analogy : Transformer Encoder Generates contextual representations from the input sequence using self-attention mechanisms (Q, K, V). The retrieved information in the Encoder -Decoder attention layer is akin to the encoder’s output ( K and V) in the Transformer, providing contextual data to enhance the generation process.

Connect the dots with prompt tuning

Philosophically, prompt tuning can be traced back to the mechanisms described in the original Transformer architecture by Vaswani et al. (2017). The core innovation of the Transformer was its self-attention mechanism, which allowed each token in the input sequence to attend to every other token, creating rich contextualized representations.

🌟 Analogy: Prompt Tuning Involves integrating specific prompts into the input sequence to steer the behavior of a pre-trained model towards a particular task. Same way it Leverages the Transformer’s self-attention mechanism, which processes the combined sequence of the original input and the prompt, influencing the model’s predictions.

In essence, while prompt tuning as a technique was not explicitly described in the original Transformer paper, the foundational principles of using context to guide model behavior were inherently present. The self-attention mechanism, with its Q, K, and V matrices, set the stage for prompt tuning by demonstrating how contextual information could be integrated into model predictions, effectively steering the model’s outputs.

“Prompt Engineering is not Prompt Tuning”

The Transformer model introduced training on specific tasks, paving the way for fine-tuning, which adapts pre-trained models to specific downstream tasks.

The Newer Fine-Tuning Techniques

🔧 PEFT (Parameter-Efficient Fine-Tuning): Techniques that optimize only a subset of parameters for enhanced efficiency.

🧩 LoRA (Low-Rank Adaptation): Reduces the number of parameters needed by focusing on low-rank adaptations.
🧮 QLoRA (Quantized Low-Rank Adaptation): Combines quantization and low-rank adaptation for even more efficient fine-tuning.
🌀 GaLore (Gradient Low-Rank Projection): Utilizes gradient-based low-rank projections to optimize parameter updates.

Technically, yes, you can fine-tune BERT using Parameter-Efficient Fine-Tuning (PEFT) techniques such as LoRA, QLoRA, and GaLore. But the real question is: do you need to? And will it enhance or degrade accuracy?

— 🧠 Think Through

* 🎯 Refocusing to our Encoder processing step** 🎯

✅Add & Normalize after Multi-Head Attention

Adds the input embeddings to the attention output and normalizes.
Input: [EncEmb1, EncEmb2, EncEmb3, EncEmb4]
Normalized Outputs: [Norm1, Norm2, Norm3, Norm4]
Mathematically:

Intuition: The normalization step stabilizes the training process by ensuring consistent scaling of the inputs.

✅Feed-Forward Neural Network:

Applies a feed-forward network to the normalized output.
Input: [Norm1, Norm2, Norm3, Norm4]
Feed-Forward Outputs: [FF1, FF2, FF3, FF4]
Mathematically:

Intuition: The feed-forward network processes each token’s contextual information independently, applying a non-linear transformation.

✅Add & Normalize after Feed Forward

Adds the normalized input to the feed-forward output and normalizes.
Input: [FF1, FF2, FF3, FF4]
Final Encoder Outputs: [EncOut1, EncOut2, EncOut3, EncOut4]
Mathematically:

Intuition: This step further refines the output, ensuring stability and consistency.

The encoder consists of multiple stacked layers to progressively build richer and more complex representations of the input sequence. Each encoder layer applies self-attention and feed-forward neural networks, allowing the model to capture intricate relationships and dependencies in the data.

After the final encoder layer, the output is a sequence of contextually enriched token embeddings. These embeddings are then passed to the decoder, where they serve as the context for generating the target sequence. The decoder uses this context to attend to relevant parts of the input while generating each token in the output sequence.

In the original Transformer model, the encoder’s final output provides the key (K) and value (V) components for the decoder. The query (Q) component in the encoder is used internally within each encoder layer for self-attention, enabling each token to attend to other tokens in the sequence.

5. Decoder Processing

First, let’s discuss the original input processing part, i.e., “Amo la ciencia.” (Spanish). The following steps occur after tokenization and embedding:

✅ Masked Multi-Head Self-Attention:

Input: [DecEmb5, DecEmb6, DecEmb7, DecEmb8]
Masked to prevent attending to future tokens.

Intuition: Masked self-attention ensures that the decoder generates each token based only on previously generated tokens, preserving the autoregressive property.

✅Add & Normalize:

Adds the input embeddings to the masked attention output and normalizes.
Input: [DecEmb5, DecEmb6, DecEmb7, DecEmb8]
Output: [DecNorm1, DecNorm2, DecNorm3, DecNorm4]
Mathematically:

Intuition: This step ensures that the added and transformed outputs maintain stability in scale.

Now let’s discuss the input from the Encoder (Key and Value) and the output of the Self-Attention masked Decoder layer (Query).

✅ Cross-Attention (Encoder-Decoder Attention):

Allows the decoder to attend to the encoder’s output.
Input: [DecNorm1, DecNorm2, DecNorm3, DecNorm4] and [EncOut1, EncOut2, EncOut3, EncOut4]

Intuition: Cross-attention allows the decoder to leverage information from the entire input sequence, aiding in generating accurate translations.

In the context of the Transformer model, Cross-Attention (Encoder-Decoder Attention) is neither full attention nor masked attention. It is specifically designed to allow the decoder to attend to the encoder’s outputs, incorporating information from the entire input sequence. Let’s clarify this in detail:

*** 🛤️ Little Digression 🛤️

Cross-Attention (Encoder-Decoder Attention)

In the decoder, the Q component comes from the previously generated tokens in the target sequence. During each decoding step, these queries are used to attend to the encoder’s keys and values, helping the model generate the next token in the output sequence based on the encoded input context and the already generated output.

Cross-Attention allows each position in the decoder to attend to all positions in the encoder. This enables the decoder to utilize the contextual information from the input sequence (processed by the encoder) to generate the output sequence.

Characteristics of Cross-Attention:

Not Full Attention: Full attention typically refers to self-attention within the encoder or decoder, where each position in the sequence attends to all other positions in the same sequence.
Not Masked Attention: Masked attention is used in the decoder’s self-attention mechanism to ensure that each position only attends to previous positions, maintaining the autoregressive property.
Cross-Attention: Specifically, in cross-attention, each position in the decoder can attend to all positions in the encoder’s output. There is no masking involved because the decoder is supposed to utilize the full context from the encoder.

Cross-encoders extend the transformer model by jointly encoding pairs of sentences or queries and documents. This allows the model to consider both inputs simultaneously, leading to more accurate and context-aware representations.

They are particularly effective for tasks like text ranking and re-ranking, where the relationship between two pieces of text is crucial .The principles and mechanisms of the transformer architecture directly influenced the development of cross-encoder models, enhancing their capability to handle complex tasks involving paired inputs. — 🧠 Think Through

RAG — Augment and Generation Component

Function: Takes the retrieved information (augmented prompt) and uses it to generate a response or output sequence.
Analogy: This mirrors how the Transformer decoder uses the encoder’s output to generate the final output sequence, combining retrieved context with the current state of generation.
Transformer Decoder: Combines encoder output (K, V) with the current output sequence (Q) to generate the next token and just like RAG Generation Component, it combines retrieved information with the input query to generate a response.

Example of a RAG-like Process Using Transformer Concepts
Retrieval (Encoder-like): Retrieve relevant information based on the input query, similar to generating contextual embeddings with self-attention.
Augmentation (Combining Encoder Output and Prompt): Combine the retrieved information with the original input to form a rich context, akin to the encoder-decoder attention mechanism.
Generation (Decoder-like): Use this augmented context to generate the final response, akin to how the Transformer decoder produces the output sequence.

*** 🎯 Refocusing to our Decoder processing step 🎯

✅Add & Normalize:

Adds the cross-attention output to the previous output and normalizes.
Input: [DecNorm1, DecNorm2, DecNorm3, DecNorm4]
Output: [CrossNorm1, CrossNorm2, CrossNorm3, CrossNorm4]
Mathematically:

Intuition: This ensures the stability and consistency of the combined output.

✅Feed-Forward Neural Network:

Applies a feed-forward network to the normalized cross-attention output.
Input: [CrossNorm1, CrossNorm2, CrossNorm3, CrossNorm4]
Output: [DecFF1, DecFF2, DecFF3, DecFF4]
Mathematically:

Intuition: This layer applies non-linear transformations to the decoder outputs, refining the generated context.

✅Add & Normalize:

Adds the feed-forward output to the normalized cross-attention output and normalizes.
Input: [DecFF1, DecFF2, DecFF3, DecFF4]
Output: [DecOut1, DecOut2, DecOut3, DecOut4]
Mathematically:

Intuition: This step ensures stability and consistency in the final decoder outputs.

6. Final Linear and SoftMax Layer

✅ Linear Layer:

Applies a linear transformation to the decoder’s output.
Input: [DecOut1, DecOut2, DecOut3, DecOut4]
Output: [LinOut1, LinOut2, LinOut3, LinOut4]
Mathematically:

Intuition: The linear layer maps the decoder outputs to the size of the vocabulary, preparing for the final prediction step.

✅Softmax Layer:

Converts the linear output to probabilities for each token in the vocabulary.
Input: [LinOut1, LinOut2, LinOut3, LinOut4]
Output: [Prob1, Prob2, Prob3, Prob4]
Mathematically:

Intuition: The softmax function converts the scores to probabilities, indicating the likelihood of each token being the correct next token.

✅Loss Calculation and Backpropagation

· Loss Calculation:

Compares the predicted probabilities with the actual target tokens.
Calculates the loss using a loss function like cross-entropy.
Mathematically:

Intuition: The loss function measures the difference between the model’s predictions and the true labels, guiding the optimization process.

· Backpropagation:

Computes gradients of the loss with respect to model parameters.
Updates the model parameters using an optimization algorithm like Adam.
Intuition: Backpropagation adjusts the model parameters to minimize the loss, improving translation accuracy.

The key (K) and value (V) components of the decoder’s masked self-attention layer are used exclusively within the self-attention mechanism of the decoder to ensure that each token can attend to the preceding tokens. They do not directly contribute to the encoder-decoder cross-attention mechanism, which instead uses the K and V components from the encoder’s output to provide contextual information from the input sequence.

The original transformer model, introduced in the “Attention is All You Need” paper, provides various weights from its trained state. These weights include :
self-attention layers : capture dependencies between tokens in the input sequence,
encoder-decoder attention layers : enable the model to focus on relevant parts of the input during decoding.
feed-forward layers : help in transforming the input representations through learned transformations, enhancing the model’s capability to process and generate sequences effectively.

Inference Process of the Original Transformer Model

1. Tokenization and Embedding (Same as Training)

Tokenization and Conversion: Convert the encoder input and initial decoder input into token IDs.
Embedding and Positional Encoding: Apply embedding and positional encoding to both the encoder and decoder inputs.
Intuition: These steps prepare the input data in a format that the model can process.

2. Encoder Processing (Same as Training)

Encoder Self-Attention:
Perform full self-attention over the input sequence.
Generate contextualized representations for each token.
Feed-Forward Neural Network: Process the contextualized representations.
Add & Normalize: Apply add and normalize layers.
Intuition: These steps generate contextualized embeddings for the input sequence.

3. Decoder Processing (Iterative Steps)

For inference, these steps are performed iteratively for each token generation:

Initial Decoder Input: “<sos>” (Start of Sentence token)

✅Tokenization and Embedding:

Tokenized: [“<sos>”]
Token ID: [ID_sos]
Embedding: [Emb_sos]

✅Positional Encoding:

Positional Encoded Embedding: [PE_sos]
Combined Embedding: [Emb_sos + PE_sos]

✅Masked Multi-Head Self-Attention:

For each token’s combined embedding (starting with “<sos>”):
Perform linear transformations to obtain Queries (Q), Keys (K), and Values (V) for each head.
Compute attention scores with masking to ensure no future tokens are attended to.
Concatenate the outputs from all heads and pass through a final linear layer.
This results in the self-attended embeddings for the tokens.

✅Add & Normalize:

Add the original combined embedding to the attention output and normalize.
Normalized Output: [DecNorm_sos]

✅Cross-Attention (Encoder-Decoder Attention):

Allows the decoder to attend to the encoder’s output:
Perform linear transformations to obtain Queries (Q) from the decoder’s normalized output and Keys (K) and Values (V) from the encoder’s output.
Compute attention scores.
Concatenate the outputs from all heads and pass through a final linear layer.
This results in the cross-attended embeddings.

✅Add & Normalize:

Add the normalized output to the cross-attention output and normalize.
Cross-Normalized Output: [CrossNorm_sos]

✅Feed-Forward Neural Network:

Apply a feed-forward network to the cross-normalized output.
Feed-Forward Output: [DecFF_sos]

✅Add & Normalize:

Add the feed-forward output to the cross-normalized output and normalize.
Final Decoder Output: [DecOut_sos]

4.Linear Transformation

Apply a linear transformation to convert the output to token probabilities.
Use softmax to convert logits to probabilities.

5. Token Selection:

Select the token with the highest probability (greedy search) or use a more complex search strategy (e.g., beam search).
For example, if the selected token is “Amo,” its token ID is [ID_amo].

6. Update Decoder Input:

Append the selected token to the decoder input sequence.
Token sequence now: [“<sos>”, “Amo”]
Repeat the process from step 1 using the updated token sequence until an end-of-sequence token (“<eos>”) is generated or the maximum sequence length is reached.

By iterating through these steps, the model translates “I love science.” to “Amo la ciencia.” in a step-by-step manner, generating one token at a time based on the previously generated tokens and the encoder’s output.

Summary for the Original Transformer Model

Training: Involves masked self-attention to ensure that each token can only attend to previous tokens, preserving the autoregressive property necessary for Causal Language Modeling. The model learns to predict each next token based on the previous tokens.
Inference: The model generates sequences in an autoregressive manner, predicting one token at a time and using the previously generated tokens to predict the next one.

This autoregressive approach ensures that the model generates coherent and contextually relevant text, one token at a time, based on both the encoder’s context and the previously generated tokens.

Ya! so. we are done !!

Key Highlights

🚀Key understanding of GAI :

Transformer: The first LLM (with Attention Mechanism) with both GAI (Generative) and CAI (Contextual) capabilities.
With models like GPT and LLaMA2 having a significantly higher number of parameters, BERT might be considered a Small Language Model (SLM) in comparison. However, as we progress, we may encounter even larger models, potentially termed Giant Language Models (GLM) or Very Large Language Models (VLLM).
LLMs are attributed with both their contextual and generative capabilities.
The Attention Algorithm is the crux of all LLMs, LVMs, LAMs, LTMs, etc.

🚀Encoder Self-Attention:

Description: A full attention mechanism where each token in the input sequence can attend to all other tokens in the sequence
Significance: Captures relationships within the input data, foundational to many modern techniques

🚀Decoder Self-Attention:

Description: A masked attention mechanism ensuring that each token can only attend to previous tokens (and itself) to maintain the autoregressive property
Example: Central to models like GPT

🚀Encoder-Decoder Attention:

Description: Cross-attention allowing the decoder to attend to the encoder’s output, integrating contextual information from the entire input sequence.
Significance: Crucial for sequence-to-sequence tasks like translation but not typically used in purely generative tasks (e.g., GPT-3).

🚀Masked Self-Attention in the Decoder:

Ensures that the model cannot peek at future tokens during both training and inference, maintaining the sequence generation order.

🚀Conceptual Links to RAG:

The self-attention mechanism in the original Transformer encoder can be linked to the prompt-retrieval process in RAG, as this context is used in the decoder generation process.

🚀Cross Encoder:

Conceptually taken from the Encoder-Decoder Cross attention calculation (Key, Value from the Encoder-Prompt and Query from the Masked Self-Attention Decoder)

🚀Causal Language Modeling (CLM):

An approach where the model generates text in an autoregressive manner, predicting the next token based on previously generated tokens. Example: Central to models like GPT (Generative Pretrained Transformer).

🚀Prompt Tuning:

Fine-tuning a pre-trained model by optimizing prompts or prefix tokens to enhance its performance on specific tasks.
Process: Adjusts the model’s parameters to improve responses based on given prompts.

🚀Prompt Engineering:

Description: The manual process of crafting and designing prompts to elicit desired outputs from a pre-trained model without altering its underlying parameters.
Strategy: Relies on understanding the model’s behavior and strategically phrasing prompts to achieve optimal results.

“If you find this article helpful, a clap and following my profile will be highly appreciated.” Cheers!

PART 1 — The First GAI-LLM: Original “Transformer” Embodied the Foundational Concepts of Modern GAI(LLM/LVM/LAM/LTMs), RAG, Prompt Tuning, and More

Content:

Demystifying Common Misconceptions

Training Process of the Original Transformer with an Example

1. Tokenization

2. Embedding

3. Positional Encoding

4. Encoder Processing

✅Multi-Head Self-Attention:

*** 🛤️ Little Digression 🛤️

Connect the dots with Retrieval-Augmented Generation (RAG)

Connect the dots with prompt tuning

The Newer Fine-Tuning Techniques

*** 🎯 Refocusing to our Encoder processing step 🎯

✅Add & Normalize after Multi-Head Attention

✅Feed-Forward Neural Network:

✅Add & Normalize after Feed Forward

5. Decoder Processing

✅ Masked Multi-Head Self-Attention:

✅Add & Normalize:

✅ Cross-Attention (Encoder-Decoder Attention):

*** 🛤️ Little Digression 🛤️

Cross-Attention (Encoder-Decoder Attention)

Characteristics of Cross-Attention:

RAG — Augment and Generation Component

*** 🎯 Refocusing to our Decoder processing step 🎯

✅Add & Normalize:

✅Feed-Forward Neural Network:

✅Add & Normalize:

6. Final Linear and SoftMax Layer

✅ Linear Layer:

✅Softmax Layer:

✅Loss Calculation and Backpropagation

· Loss Calculation:

· Backpropagation:

Inference Process of the Original Transformer Model

1. Tokenization and Embedding (Same as Training)

2. Encoder Processing (Same as Training)

3. Decoder Processing (Iterative Steps)

✅Tokenization and Embedding:

✅Positional Encoding:

✅Masked Multi-Head Self-Attention:

✅Add & Normalize:

✅Cross-Attention (Encoder-Decoder Attention):

✅Add & Normalize:

✅Feed-Forward Neural Network:

✅Add & Normalize:

4.Linear Transformation

5. Token Selection:

6. Update Decoder Input:

Key Highlights

🚀Key understanding of GAI :

🚀Encoder Self-Attention:

🚀Decoder Self-Attention:

🚀Encoder-Decoder Attention:

🚀Masked Self-Attention in the Decoder:

🚀Conceptual Links to RAG:

🚀Cross Encoder:

🚀Causal Language Modeling (CLM):

🚀Prompt Tuning:

🚀Prompt Engineering:

Written by Trishul Chowdhury

***** 🛤️ *Little Digression*** 🛤️

* 🎯 Refocusing to our Encoder processing step** 🎯