The Evolution of NLP: From Embeddings to Transformer-Based Models

A Deep Dive into the Transformer Architecture, Attention Mechanisms, and the Pre-Training to Fine-Tuning Workflow

8 min readJul 8, 2024

The transformer architecture, introduced in 2017, has revolutionized NLP by leveraging attention mechanisms and advanced embedding techniques. This blog post explores the evolution of embeddings and provides a detailed understanding of transformer-based models. These concepts are fundamental for advanced NLP systems like RAG. For an introductory overview, refer to “Intuitive Insights into Data Science, NLP, and Large Language Models,” and for practical RAG implementation, see “RAG Basics: Basic Implementation of Retrieval-Augmented Generation (RAG).”

· The Evolution of Embeddings
· Understanding Attention Mechanism: The Gravitational Pull Analogy
· Transformer Architecture and Transformer-Based Models
· Comparison of Models for Question Answering (QA)
· Pre-training and Fine-Tuning Process
· Summary
· Further Reading

The Evolution of Embeddings

Understanding natural language by computers requires converting text into numerical representations, known as embeddings, that machines can process. The journey of numerical representation in NLP has evolved significantly over time.

Numerical Representation by Individual Words

Initially, techniques like Bag-of-Words and One-Hot Encoding were used. These methods, although simple, had limitations as they did not capture the context of words within sentences.

Numerical Representation by Frequency

To address this, TF-IDF (Term Frequency-Inverse Document Frequency) was introduced. It weighed the importance of words based on their frequency across documents, offering a more nuanced understanding compared to simple counts. However, TF-IDF still treated words independently, without considering their context.

Numerical Representation by Context

The advent of Word2Vec marked a significant leap. This technique generated dense vector representations for words, capturing semantic relationships by training on large corpora of text. Words used in similar contexts had similar vectors, offering a more meaningful representation. Subsequent advancements led to Elmo (Embeddings from Language Models), which generated word embeddings that considered the entire sentence, thus capturing context more effectively.

Numerical Representation by Order

The importance of the order of words was also considered in models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory), which processed sequences of words to maintain contextual information. These models recognized that the sequence in which words appear carries significant meaning, enhancing the understanding of context.

Numerical Representation by Attention

The latest in this evolutionary chain are Sentence Transformers, which create embeddings for entire sentences, maintaining context and semantic meaning comprehensively. Transformers introduced the groundbreaking concept of attention mechanisms, allowing the model to weigh the importance of different words in a sentence dynamically. This was further enhanced by positional encoding, which explicitly adds information about the order of words, making the model even more adept at understanding context.

Understanding Attention Mechanism: The Gravitational Pull Analogy

The attention mechanism is a transformative concept in natural language processing, enabling models to dynamically focus on different parts of the input sentence when generating an output. This process can be visualized with the following analogy and examples:

Initial State (Top-Left)

Imagine we have the sentence “Apple released a new iPhone” 📱 and “A refreshing snack of Apple slices and grapes.”🍏🍇
Each word is converted into a numerical vector in a high-dimensional space.
The position of each word in this space is represented by coordinates, e.g., “Apple” might be [2,4] when it means the company, and [4,8] when it refers to the fruit.

Contextual Representation with Gravitational Pull (Top-Right)

When processing the sentence, the model must decide which meaning of “Apple” to use based on the context.
Attention functions like gravitational pull, where important words exert a stronger gravitational force, drawing the relevant meanings of “Apple” closer.
For instance, in “Apple released a new iPhone,” the word “iPhone”📱 exerts a strong pull on “Apple,” ensuring the model understands it as the company ([2,4]).
Conversely, in “A refreshing snack of Apple slices and grapes,” the words “slices” and “grapes” 🍇 exert a strong pull, ensuring “Apple” is understood as the fruit 🍏([4,8]).

Dynamic Adjustment with Gravitational Forces (Bottom-Left)

The attention mechanism dynamically adjusts the pull (or weight) of each word based on context.
For example, in the sentence “Apple released a new iPhone,” the gravitational pull from “released” and “iPhone” adjusts “Apple” to the appropriate vector ([2,4]).
This is similar to how gravitational forces in space pull objects closer based on their mass and distance.

Positional Encoding and Enhanced Understanding (Bottom-Right)

This combined with attention allows the model to distinguish between meanings effectively.
In “A refreshing snack of Apple slices and grapes,” the context provided by “snack” and “grapes”🍇 pulls “Apple” 🍏to the fruit meaning ([4,8]), much like how planets exert gravitational forces that define their positions in space.

Additional Example:

To further illustrate the attention mechanism, consider the sentence: “The black cat drank white milk.” In this case, attention helps establish connections between related words:

“Cat”🐱 and “black”🖤: There is a strong connection here because the adjective “black” describes the “cat.”
“Milk”🥛 and “white”🤍: Similarly, “white” describes “milk,” establishing a strong connection.
“Cat”🐱 and “milk”🥛: There is a meaningful connection here as well, as the cat is performing the action of drinking the milk.

However, the connection between “black”🖤 and “milk”🥛 or “cat”🐱 and “white” 🤍 is weaker because they are less contextually related in this specific sentence.

The attention mechanism, akin to gravitational forces, revolutionizes NLP by allowing models to dynamically consider the relevance of each word in context. This leads to more accurate and contextually appropriate representations and responses, significantly improving the model’s understanding and generation capabilities.

Transformer Architecture and Transformer-Based Models

https://factored.ai/transformer-based-language-models/

The Transformer Architecture

At the heart of the transformer model are two main components: the encoder and the decoder. Each plays a crucial role in processing and generating text.

Encoder:

Multi-Head Attention: Allows the model to focus on different parts of the input sequence simultaneously, capturing dependencies and contextual relationships.
Feed Forward Neural Network: Processes the output of the attention mechanism, adding non-linearity.
Add & Normalize: Applies residual connections and layer normalization to stabilize training.
Positional Encoding: Adds information about the position of words, enabling the model to understand the order of the sequence.

Decoder:

Masked Multi-Head Attention: Ensures the model attends only to previous positions in the sequence, maintaining the autoregressive nature of generation.
Multi-Head Attention: Aligns the encoder’s output with the ongoing generation process.
Feed Forward Neural Network: Similar to the encoder, processes combined information from attention layers.
Add & Normalize: Applies residual connections and layer normalization.
Positional Encoding: Similar to the encoder, adds positional information to the input embeddings.

Types of Transformer-Based Models

Encoder-Only Models

Encoder-only models focus on understanding and representing text, making them suitable for tasks like text classification and question answering.

https://medium.com/bright-ml/nlp-deep-learning-models-difference-between-bert-gpt-3-f273e67597d7

Examples:

BERT (Bidirectional Encoder Representations from Transformers): Uses bidirectional context to understand the text.
RoBERTa, ALBERT, DistilBERT: Variants of BERT optimized for different trade-offs in performance and efficiency.

Tasks:

Text classification
Named entity recognition
Sentence similarity
Question answering (QA)

Decoder-Only Models

Decoder-only models excel at generating text based on a given context.

Examples:

GPT (Generative Pre-trained Transformer): Generates coherent and contextually relevant text.

Tasks:

Text generation
Summarization
Translation
Creative writing
Question answering (QA)

Encoder-Decoder Models

Encoder-decoder models are versatile, handling both understanding and generating text.

Examples:

T5 (Text-To-Text Transfer Transformer): Treats every NLP problem as a text-to-text problem.
BART (Bidirectional and Auto-Regressive Transformers): Effective for text generation and translation.
mBART: A multilingual version of BART for translation tasks.

Tasks:

Machine translation
Text summarization
Question answering (QA)
Text-to-text transformation

Comparison of Models for Question Answering (QA)

QA is a task that can be performed by all types of models, but they approach it differently:

Decoder-Only Models (e.g., GPT): Generate answers by predicting the next word based on context. Suitable for open-ended and generative QA.
Encoder-Decoder Models (e.g., T5, BART): Combine understanding and generation, handling both extractive and abstractive QA tasks.

Pre-training and Fine-Tuning Process

Transformer models undergo a multi-stage training process to optimize their performance for specific tasks.

General Pre-training

Data: General unsupervised data (e.g., Wikipedia, web text).

https://x.com/TsinghuaNLP/status/1377804676943372288

Technique: Random masking of tokens. The model learns general language patterns by predicting masked tokens.
Outcome: Broad language understanding.

Task-Guided Pre-training

Data: In-domain unsupervised data relevant to specific tasks.
Technique: Selective masking of tokens tailored to the task.
Outcome: Improved domain-specific understanding.

Fine-Tuning

Data: Downstream supervised data with labeled exampl es.
Technique: Fine-tuning on task-specific data, adjusting model parameters.
Outcome: High performance on specific tasks.

Summary

This blog post traces the development of NLP from early methods like Bag-of-Words to advanced transformer models. It delves into the importance of embeddings, attention mechanisms, and the transformer architecture, providing a deep understanding of these crucial concepts. This knowledge is essential for anyone looking to understand or implement RAG systems.

Transformers, with their encoder-decoder structure and innovative attention mechanisms, have set new standards in NLP. The multi-stage training process — comprising general pre-training, task-guided pre-training, and fine-tuning — ensures that these models are both versatile and proficient in various applications. Encoder-only models like BERT excel at understanding text, decoder-only models like GPT are powerful generators, and encoder-decoder models like T5 and BART are highly adaptable. This comprehensive approach allows transformers to handle a wide range of natural language processing tasks with exceptional performance.

Help others discover this valuable information by clapping 👏 (up to 50 times!). Your claps will help spread the knowledge to more readers.

The Evolution of NLP: From Embeddings to Transformer-Based Models

A Deep Dive into the Transformer Architecture, Attention Mechanisms, and the Pre-Training to Fine-Tuning Workflow

The Evolution of Embeddings

Numerical Representation by Individual Words

Numerical Representation by Frequency

Numerical Representation by Context

Numerical Representation by Order

Numerical Representation by Attention

Understanding Attention Mechanism: The Gravitational Pull Analogy

Additional Example:

Transformer Architecture and Transformer-Based Models

The Transformer Architecture

Types of Transformer-Based Models

Encoder-Only Models

Decoder-Only Models

Encoder-Decoder Models

Comparison of Models for Question Answering (QA)

Pre-training and Fine-Tuning Process

General Pre-training

Task-Guided Pre-training

Fine-Tuning

Summary

Further Reading

Written by Dina Bavli