GPT-2 Detailed Model Architecture

3 min readSep 15, 2024

This post presents a detailed architectural diagram of GPT-2 that shows how input data transforms as it flows through the model. The diagram is meant to help you trace the steps the model takes from input to output, while I’ll also give a brief explanation of how text is prepared before it’s fed into the model.

Text Preparation

Before we get to the diagram, let’s first understand how text input is prepared. The raw text has to be tokenized, which means converting words into integers that map to indices in the model’s vocabulary. GPT-2 uses a process called Byte-Pair Encoding (BPE) for tokenization, which breaks text down into subwords.

We will use Tiktoken, a fast BPE tokenizer used in OpenAI’s models. Here is an example:

import tiktoken

enc = tiktoken.get_encoding("gpt2") # Load the GPT-2 tokenizer

text = """In a remote mountain village, nestled among the towering peaks and pine forests, \
a solitary storyteller sat by a crackling bonfire. Their voice rose and fell like a melodic river, \
weaving tales of ancient legends and forgotten heroes that captivated the hearts of the villagers gathered around. \
The stars above shone brightly, their twinkling adding a celestial backdrop to the mesmerizing stories, \
as the storyteller transported their audience to worlds of wonder and imagination. """


tokens = enc.encode(text) # Tokenize the text
tokens.append(enc.eot_token) # Ａdd the end of text token

print(tokens[:10])

[818, 257, 6569, 8598, 7404, 11, 16343, 992, 1871, 262]

This turns the text into a list of integers representing subwords or words from GPT-2’s vocabulary. We also append a special <|endoftext|> token to mark the end of the text sequence.

Next, we will rearrange the tokens into a format the GPT model can process. The model expects the input data to be in the form of a batch of sequences. The input tensor should have the shape (B, T), where:

B is the batch size (how many sequences we process in parallel).
T is the sequence length (number of tokens in each sequence).

To demonstrate, we will create a batch of 5 sequences, each containing 10 tokens, using the first 50 tokens of the input:

import torch

B, T = 5, 10
data = torch.tensor(tokens[:50+1])

x = data[:-1].view(B, T) # Input tensor
y = data[1:].view(B, T)  # Target tensor for next token prediction

print(x)
print(y)

tensor([[  818,   257,  6569,  8598,  7404,    11, 16343,   992,  1871,   262],
        [38879, 25740,   290, 20161, 17039,    11,   257, 25565,  1621,   660],
        [ 6051,  3332,   416,   257,  8469,  1359,  5351,  6495,    13,  5334],
        [ 3809,  8278,   290,  3214,   588,   257,  7758, 29512,  7850,    11],
        [44889, 19490,   286,  6156, 24901,   290, 11564, 10281,   326,  3144]])

tensor([[  257,  6569,  8598,  7404,    11, 16343,   992,  1871,   262, 38879],
        [25740,   290, 20161, 17039,    11,   257, 25565,  1621,   660,  6051],
        [ 3332,   416,   257,  8469,  1359,  5351,  6495,    13,  5334,  3809],
        [ 8278,   290,  3214,   588,   257,  7758, 29512,  7850,    11, 44889],
        [19490,   286,  6156, 24901,   290, 11564, 10281,   326,  3144, 30829]])

Here, x contains the input tokens, and y contains the target tokens, shifted by one position so that the model can learn to predict the next token in the sequence. During training, these tensors are fed into the model, with the target tensor y used to calculate the loss using cross-entropy loss.

GPT-2 Architectural Diagram

Finally, let’s take a look at the GPT-2 architectural diagram to better understand how the input data moves through the model. Below are some key configuration parameters for GPT-2: