Implementation of GPT from scratch

14 min readJul 19, 2024

In recent years, the landscape of artificial intelligence has been dramatically reshaped by the advent of Large Language Models (LLMs). These powerful models, exemplified by OpenAI’s GPT-3 and GPT-4, have demonstrated remarkable capabilities in understanding and generating human-like text. From composing essays to coding, and even engaging in meaningful conversations, LLMs have revolutionized the way we interact with technology.

Yet, the underlying mechanics of these language models often remain shrouded in mystery. How do they work? What makes them capable of producing such human-like text? In this article, we will demystify the inner workings of LLMs and explore the foundational principles that enable them to generate text with such proficiency. Central to their operation is a sophisticated neural network architecture known as the Transformer. This architecture, pioneered by researchers at Google, has become a pivotal innovation in the realm of natural language processing (NLP).

Join us as we unravel the complexities behind these formidable models and provide a comprehensive guide to building your very own GPT from scratch. Whether you are an AI aficionado, a developer, or simply intrigued by the potential of AI, this exploration will equip you with the insights and knowledge to embark on your own journey into the world of Large Language Models.

Disclaimer: I want to acknowledge that this article was inspired by Andrej Karpathy’s tutorial on implementing GPT from scratch.

What is a Language Model ?

A language model is a statistical model that models the distribution of sequences of words, or more generally, sequences of discrete symbols (such as phonemes or words), in a natural language. The language model is defined using a probability function p that assigns a probability p(s) to a word or a sequence of words s in a given language. With a language model, one can, for example, estimate the probability of any given sequence of words in the language or predict the next word in a sequence of words. Language models come in different types that can be categorized into two main groups: statistical models and those based on deep neural networks. Statistical models, such as N-gram models, predict the probability of a word based on the previous words in the sequence and often use smoothing techniques to improve their performance. Deep neural network-based models, including Recurrent Neural Networks (RNNs), Transformers, and pre-trained models like BERT and GPT, leverage advanced architectures to capture long-term dependencies and complex patterns in text, offering more powerful and flexible capabilities.

Fundamentals: Understanding the Basics.

To build a GPT model that generates Shakespeare-esque text, you need a substantial amount of Shakespearean text. To gather this text, run the provided line of code in your Jupyter notebook.

!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Now that we have our input text , we will run the following code :

chars = sorted(list(set(content)))
vocab_size = len(chars)

The first line extracts and sorts unique characters from the text to create a sorted list of distinct characters, representing the text’s vocabulary. The second line calculates the size of this vocabulary, which is the total number of unique characters.

After obtaining the unique characters, we need to map each character to an index and each index to a character. This can be achieved by running the following lines of code:

def create_mappings(unique_chars):
    char_to_int = {char: idx for idx, char in enumerate(unique_chars)}
    int_to_char = {idx: char for idx, char in enumerate(unique_chars)}
    return char_to_int, int_to_char
char_to_int, int_to_char = create_mappings(chars)

These dictionaries map each unique character to an index (char_to_int) and each index back to the character (int_to_char). This conversion to numerical representations is crucial for neural networks, which require numeric input. The mappings transform text into numerical sequences, enabling the network to learn patterns and dependencies during training. Additionally, these numeric representations can be processed through an embedding layer in models like GPT, enhancing the model's ability to understand and generalize from the input text.

With the mappings established, we can define two functions: an encoding function and a decoding function. Here is the encoding function:

def encode(text, char_to_int):
    try:
        return [char_to_int[char] for char in text]
    except KeyError as e:
        raise ValueError(f"Character {e} not found in the character set.") from e

The `encode(s)` function converts a string into a list of integers using the `char_to_int` dictionary. This numeric representation is crucial for neural networks in natural language processing, enabling efficient learning of text patterns and relationships. For example, with a `char_to_int` mapping of {h: 0, e: 1, l: 2, o: 3}, the string “hello” becomes [0, 1, 2, 2, 3]. This conversion facilitates further processing, such as through an embedding layer, enhancing the model’s understanding and generalization of the text.

Here is the decoding function:

def decode(indices, int_to_char):
    try:
        return ''.join([int_to_char[idx] for idx in indices])
    except KeyError as e:
        raise ValueError(f"Index {e} not found in the integer-to-character mapping.") from e

The `decode(l)` function converts a list of integers back into a human-readable string using the `int_to_char` dictionary. Each integer in the list corresponds to a character in the vocabulary, effectively reconstructing the text from numerical indices generated by a neural network. For example, with an input list [19, 17, 0, 13, 18, 5, 14, 15, 4, 11, 8, 12, 6] and an `int_to_char` mapping {0: ‘a’, 5: ‘f’, 6: ‘m’, 8: ‘i’, 11: ‘r’, 12: ‘s’, 13: ’n’, 14: ‘o’, 15: ‘r’, 17: ‘t’, 18: ‘s’, 19: ‘t’}, the function outputs the string “transformer”. This function is essential for interpreting and utilizing the neural network’s output in natural language processing tasks, making the numeric predictions comprehensible and actionable.

What is the context ?

import torch
data = torch.tensor(encode(content,char_to_int), dtype=torch.long)

The code snippet converts text into a PyTorch tensor by first encoding the text into integers and then creating a tensor of long integers. This tensor is a numerical representation of the text, making it suitable for input into neural networks for further analysis or training.

Each number in the tensor represents a specific element or feature of the input text, capturing its sequential and hierarchical structure. This numerical format enables mathematical operations and machine learning techniques for analysis and predictions. The next step involves splitting the data into a training set (90%) and a testing set (10%).

Context refers to using a sequence of previous values to predict what comes next. For example, in the sequence [18, 47, 56, 57, 58, 1, 15, 47, 58], the context (18, 47, 56 , 57) suggests that 58 should follow. In natural language processing, context is crucial for interpreting numerical representations of text. Machine learning models use context to identify patterns and make predictions based on sequences. By training on diverse sequences, models learn to understand and predict language patterns effectively.

Next , we define the following function to perform this context prediction:

def get_batch(split):
    """
    Generate a batch of data with input sequences and target sequences.

    Args:
        split (str): Indicates which dataset to use ('train' or 'val').

    Returns:
        tuple: A tuple containing two tensors: 
            - x (inputs): Tensor of shape (batch_size, block_size) with input sequences.
            - y (targets): Tensor of shape (batch_size, block_size) with target sequences.
    """
    # Select dataset based on split
    data = train_data if split == 'train' else val_data

    # Randomly select starting indices for the sequences
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # Create input sequences (x) and target sequences (y)
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])

    return x, y

Bigram Language Model :

A bigram language model is a foundational approach in natural language processing that models the probability of a word based on the occurrence of the immediately preceding word. This approach relies on the Markov hypothesis, which assumes that the probability of a word depends only on a limited history of preceding words, typically just one word in the case of bigrams.The Markov hypothesis simplifies the model by focusing on the immediate preceding word rather than the entire sequence of words. This approach captures some local structure in language but is limited in handling longer-range dependencies.

Let’s implement this model :

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        """
        Initialize the Bigram Language Model.

        Args:
            vocab_size (int): The size of the vocabulary (number of unique tokens).
        """
        super().__init__()
        # Initialize an embedding table where each token directly maps to its logits
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        """
        Perform a forward pass of the model.

        Args:
            idx (torch.Tensor): Tensor of shape (B, T) containing token indices.
            targets (torch.Tensor, optional): Tensor of shape (B, T) containing target indices.

        Returns:
            logits (torch.Tensor): Logits for the next token of shape (B, T, C).
            loss (torch.Tensor, optional): Cross-entropy loss, if targets are provided.
        """
        # Compute logits for the next token
        logits = self.token_embedding_table(idx)  # Shape: (B, T, C)

        if targets is not None:
            # Reshape logits and targets for loss calculation
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            # Compute cross-entropy loss
            loss = F.cross_entropy(logits, targets)
        else:
            loss = None

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """
        Generate a sequence of tokens given an initial context.

        Args:
            idx (torch.Tensor): Tensor of shape (B, T) containing initial token indices.
            max_new_tokens (int): The number of new tokens to generate.

        Returns:
            idx (torch.Tensor): Tensor of shape (B, T + max_new_tokens) with the generated sequence.
        """
        for _ in range(max_new_tokens):
            # Get logits and loss (loss is ignored here)
            logits, _ = self(idx)
            # Focus on the logits of the last time step
            logits = logits[:, -1, :]  # Shape: (B, C)
            # Compute probabilities using softmax
            probs = F.softmax(logits, dim=-1)  # Shape: (B, C)
            # Sample the next token indices
            idx_next = torch.multinomial(probs, num_samples=1)  # Shape: (B, 1)
            # Append the sampled index to the sequence
            idx = torch.cat((idx, idx_next), dim=1)  # Shape: (B, T + 1)
        return idx

Overview
This class defines a Bigram Language Model using PyTorch’s neural network module (`nn.Module`). The model predicts the next token in a sequence based on the current token, utilizing a bigram approach.

Components

1. Initialization (`__init__`):
— vocab_size: The size of the vocabulary (number of unique tokens).
— token_embedding_table: An embedding table where each token maps directly to its logits. It is initialized to have dimensions `(vocab_size, vocab_size)`, meaning each token has a vector of size `vocab_size` representing logits for the next token.

2.Forward Pass (`forward`):
— Inputs:
—idx: A tensor of shape `(B, T)` containing token indices, where `B` is the batch size and `T` is the sequence length.
— targets: (Optional) A tensor of shape `(B, T)` containing the target token indices.
— Outputs:
— logits: A tensor of shape `(B, T, C)` representing the logits for the next token, where `C` is the vocabulary size.
— loss: (Optional) Cross-entropy loss if targets are provided.
— Process:
— The `idx` tensor is passed through the embedding table to get logits of shape `(B, T, C)`.
— If `targets` are provided, logits and targets are reshaped to `(B * T, C)` and `(B * T)`, respectively, for calculating cross-entropy loss.

3. Generate (`generate`):
— Inputs:
— idx: A tensor of shape `(B, T)` containing initial token indices.
— max_new_tokens: The number of new tokens to generate.
— Output:
— idx: A tensor of shape `(B, T + max_new_tokens)` containing the generated sequence.
— Process:
— Iteratively generate new tokens by:
— Getting the logits for the last time step.
— Applying softmax to get probabilities.
— Sampling the next token index.
— Appending the sampled index to the sequence.

Transformer architecture :

The transformer architecture, introduced by Google’s Brain division in the 2017 paper “Attention Is All You Need,” revolutionized NLP by surpassing the capabilities of earlier RNNs. Transformers have set new benchmarks in tasks such as machine translation, sentiment analysis, question answering, and text summarization, ushering in the era of large language models (LLMs).

To understand how transformers work, imagine trying to comprehend a lengthy story. Instead of reading sequentially and forgetting earlier parts, you focus on important sections and ignore less significant ones. Similarly, transformers:

1. Pay Attention: Focus on important words in a sentence.
2. Understand Context: Consider all words together, not sequentially, to understand dependencies.
3. Weigh Relationships: Determine how words are related within the context.
4. Combine Insights: Synthesize this information to grasp the overall meaning.
5. Predict Next Steps: Use this understanding to predict subsequent words.

The core innovation of transformers is the self-attention mechanism, allowing each word to consider the entire sentence’s context. This is like a person varying their attention to different parts of a conversation, enabling a deeper understanding of the sequence.

The transformer architecture comprises two main components: the encoder and the decoder. The encoder processes the input sequence into meaningful representations, while the decoder generates the output sequence from these representations, such as a translation or text continuation.

Before processing text, it must be converted into numerical tokens through embedding. Embedding involves:

1. Word to Vector Conversion: Assigning each word a unique numerical vector representing its meaning and context. These vectors are often pre-trained on large text corpora to capture semantic relationships.
2. Embedding Layer: This layer acts as a lookup table, mapping each word to its corresponding vector.

Positional encoding provides information about the position or order of words in a sequence, addressing the transformer’s lack of inherent word order understanding. Positional encoding works by:

1. Assigning Positions and Preserving Order Information: Each word in the input sequence gets a unique positional encoding vector representing its position.
2. Incorporating into Word Embeddings: These vectors are added to the word embeddings to include positional information, allowing the model to distinguish between identical words in different positions.

The combined token embeddings and positional encodings are then fed to the self-attention layer in the transformer model.

Let’s discuss what really happens in the self-attention layer :

The self-attention layer is a pivotal element of the transformer architecture, enabling it to capture relationships and dependencies between words in a sequence effectively.

# Define dimensions
B, T, C = 4, 8, 32  # Batch size, time steps, channels

# Create random input tensor
x = torch.randn(B, T, C)

# Define the head size for attention
head_size = 16

# Define linear transformations for key, query, and value
key_linear = nn.Linear(C, head_size, bias=False)
query_linear = nn.Linear(C, head_size, bias=False)
value_linear = nn.Linear(C, head_size, bias=False)

# Compute key, query, and value matrices
k = key_linear(x)  # (B, T, head_size)
q = query_linear(x)  # (B, T, head_size)
v = value_linear(x)  # (B, T, head_size)

# Compute dot-product attention scores
attention_scores = torch.bmm(q, k.transpose(1, 2))  # (B, T, head_size) x (B, head_size, T) => (B, T, T)

# Create a lower triangular mask to enforce causal (autoregressive) property
mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0)  # (1, T, T)

# Apply mask to attention scores
attention_scores = attention_scores.masked_fill(mask == 0, float('-inf'))

# Normalize attention scores with softmax
attention_weights = F.softmax(attention_scores, dim=-1)  # (B, T, T)

# Compute the weighted sum of values
output = torch.bmm(attention_weights, v)  # (B, T, T) x (B, T, head_size) => (B, T, head_size)

# Output has the shape (B, T, head_size)
print(output.shape)

1. Create Queries, Keys, and Values: For each word in the input sequence, the self-attention layer generates three types of vectors:
— Query (Q): Focuses on the current word.
— Key (K): Represents other words to evaluate their relevance to the Query.
— Value (V): Contains the information from words that match the Query.

2. Derive Attention Scores: Scores are computed between Query (Q) and Key (K) vectors to assess how much attention each word should receive. Higher scores indicate greater relevance.

3. Calculate Weighted Sum: The attention scores are used to compute a weighted sum of Value (V) vectors, with higher-scoring words contributing more to the final representation.

4. Apply Multiple Attention Heads: Multiple sets of Queries, Keys, and Values (attention heads) allow the model to simultaneously focus on different aspects of the input sequence.

5. Generate Output: The output consists of contextualized word representations, which reflect their relationships with other words and are used as input for the Feed-Forward layer.

And now it comes the easier part , feed-forward layer :

1. Linear Transformation: Applies a linear transformation to the input, projecting it into a different (often higher-dimensional) space using learnable weights and bias.

2. Activation Function: A non-linear activation function (e.g., ReLU) is applied element-wise to introduce non-linearity, allowing the model to capture complex patterns.

3. Second Linear Transformation: A subsequent linear transformation further adapts the data using different weights and bias, refining the representation for further processing.

4. Residual Connection and Layer Normalization:
— Residual Connection: Adds the original input to the transformed output of the layer, helping the network blend initial and refined information and mitigating challenges like vanishing gradients.
— Layer Normalization: Standardizes the values passing through layers to maintain consistency and stabilize training by ensuring the average is 0 and standard deviation is 1.

These components enhance the model’s ability to process and extract meaningful information from sequences.

Now , let’s return to examining the Encoder process:

If there are more Encoder blocks, the output of the current Encoder block (after layer normalization) is passed to the next Encoder block. If it’s the final Encoder block, its output is sent to the Decoder, where it serves as the key and value for the Decoder’s Multi-Head Attention mechanism. The following steps describe how the Decoder processes this input.

Now , we move to break down the decoder process in the transformer architecture :

1. Input Embedding: The target sequence (e.g., partial translation) is embedded into continuous vectors and combined with positional encodings to provide order information.

2. Masked Multi-Head Self-Attention: The decoder uses masked self-attention, which ensures that each position can only attend to earlier positions (or itself), preserving the autoregressive property. This masking is crucial during training but relaxed during inference.

3. Residual Connection & Layer Normalization (post Self-Attention): The output of the masked self-attention layer is added to its input via a residual connection, followed by layer normalization to stabilize and scale activations.

4. Multi-Head Attention over Encoder’s Output: This attention mechanism uses the encoder’s output as keys and values, and the self-attention output from the decoder as queries. It helps the decoder focus on relevant parts of the source sequence while generating the target sequence.

5. Residual Connection & Layer Normalization: The output from the encoder-based multi-head attention is added to its input through a residual connection, followed by another layer normalization.

6. Position-wise Feed-Forward Network: Each position’s representation is processed through a position-wise feed-forward network, consisting of two linear layers with a ReLU activation in between. This adds expressive power to the decoder.

7. Residual Connection & Layer Normalization: The output of the feed-forward network is combined with its input via a residual connection and normalized.

8. Stacking of Decoder Blocks: Multiple decoder blocks are stacked, with the output from one block serving as the input to the next, allowing for deeper processing and refinement.

9. Output Linear Layer & Softmax: After passing through all decoder blocks, the final linear layer maps the output to the target vocabulary size, and a softmax function generates the probability distribution over possible output tokens, producing the final sequence.

In summary, the decoder processes the target sequence through a series of self-attention and cross-attention mechanisms, along with feed-forward networks, to generate a coherent and contextually accurate output sequence.

Conclusion :

In this article, we built our own GPT model from scratch, exploring the intricacies of transformer architecture and implementing self-attention mechanisms, multi-head attention, and feed-forward layers. We walked through the process of creating a neural network capable of understanding and generating human-like text, achieving a fundamental understanding of the key components that make up a GPT model.

To further support your learning and experimentation, I have provided a Jupyter Notebook that includes the complete implementation of our GPT model along with details of the development process. Feel free to explore the code, run experiments, and modify the model to fit your needs. I hope this hands-on approach enhances your understanding and inspires you to dive deeper into the fascinating realm of natural language processing.

Happy coding!