This post presents a detailed architectural diagram of GPT-2 that shows how input data transforms as it flows through the model. The diagram is meant to help you trace the steps the model takes from input to output, while I’ll also give a brief explanation of how text is prepared before it’s fed into the model.
Text Preparation
Before we get to the diagram, let’s first understand how text input is prepared. The raw text has to be tokenized, which means converting words into integers that map to indices in the model’s vocabulary. GPT-2 uses a process called Byte-Pair Encoding (BPE) for tokenization, which breaks text down into subwords.
We will use Tiktoken, a fast BPE tokenizer used in OpenAI’s models. Here is an example:
import tiktoken
enc = tiktoken.get_encoding("gpt2") # Load the GPT-2 tokenizer
text = """In a remote mountain village, nestled among the towering peaks and pine forests, \
a solitary storyteller sat by a crackling bonfire. Their voice rose and fell like a melodic river, \
weaving tales of ancient legends and forgotten heroes that captivated the hearts of the villagers gathered around. \
The stars above shone brightly, their twinkling adding a celestial backdrop to the mesmerizing stories, \
as the storyteller transported their audience to worlds of wonder and imagination. """
tokens = enc.encode(text) # Tokenize the text
tokens.append(enc.eot_token) # Add the end of text token
print(tokens[:10])
[818, 257, 6569, 8598, 7404, 11, 16343, 992, 1871, 262]
This turns the text into a list of integers representing subwords or words from GPT-2’s vocabulary. We also append a special <|endoftext|>
token to mark the end of the text sequence.
Next, we will rearrange the tokens into a format the GPT model can process. The model expects the input data to be in the form of a batch of sequences. The input tensor should have the shape (B, T), where:
B
is the batch size (how many sequences we process in parallel).T
is the sequence length (number of tokens in each sequence).
To demonstrate, we will create a batch of 5 sequences, each containing 10 tokens, using the first 50 tokens of the input:
import torch
B, T = 5, 10
data = torch.tensor(tokens[:50+1])
x = data[:-1].view(B, T) # Input tensor
y = data[1:].view(B, T) # Target tensor for next token prediction
print(x)
print(y)
tensor([[ 818, 257, 6569, 8598, 7404, 11, 16343, 992, 1871, 262],
[38879, 25740, 290, 20161, 17039, 11, 257, 25565, 1621, 660],
[ 6051, 3332, 416, 257, 8469, 1359, 5351, 6495, 13, 5334],
[ 3809, 8278, 290, 3214, 588, 257, 7758, 29512, 7850, 11],
[44889, 19490, 286, 6156, 24901, 290, 11564, 10281, 326, 3144]])
tensor([[ 257, 6569, 8598, 7404, 11, 16343, 992, 1871, 262, 38879],
[25740, 290, 20161, 17039, 11, 257, 25565, 1621, 660, 6051],
[ 3332, 416, 257, 8469, 1359, 5351, 6495, 13, 5334, 3809],
[ 8278, 290, 3214, 588, 257, 7758, 29512, 7850, 11, 44889],
[19490, 286, 6156, 24901, 290, 11564, 10281, 326, 3144, 30829]])
Here, x
contains the input tokens, and y
contains the target tokens, shifted by one position so that the model can learn to predict the next token in the sequence. During training, these tensors are fed into the model, with the target tensor y
used to calculate the loss using cross-entropy loss.
GPT-2 Architectural Diagram
Finally, let’s take a look at the GPT-2 architectural diagram to better understand how the input data moves through the model. Below are some key configuration parameters for GPT-2:
- Vocabulary Size (V): 50,257
- Maximum Sequence Length (T): 1024
- Embedding Dimensionality (C): 768
- Number of Heads (h): 12
- Number of Layers (N): 12
- Batch Size (B): 512
The diagram and configuration will help you trace how the data transforms at each stage, from the input to the output.
Glossary
References
- Vaswani, Ashish, et al. “Attention Is All You Need. ArXiv, 2017.
- Brown, Tom B., et al. “Language Models Are Few-Shot Learners (GPT-3).” ArXiv, 2020.
- Better Language Models and Their Implications, OpenAI Blog, 2019.
- Karpathy, Andrej. “Let’s Reproduce GPT-2 (124M).” YouTube.
- Karpathy, Andrej. “nanoGPT” repo.