GPT-2 Detailed Model Architecture

Henry Wu
3 min readSep 15, 2024

--

This post presents a detailed architectural diagram of GPT-2 that shows how input data transforms as it flows through the model. The diagram is meant to help you trace the steps the model takes from input to output, while I’ll also give a brief explanation of how text is prepared before it’s fed into the model.

Photo by Mariia Shalabaieva on Unsplash

Text Preparation

Before we get to the diagram, let’s first understand how text input is prepared. The raw text has to be tokenized, which means converting words into integers that map to indices in the model’s vocabulary. GPT-2 uses a process called Byte-Pair Encoding (BPE) for tokenization, which breaks text down into subwords.

We will use Tiktoken, a fast BPE tokenizer used in OpenAI’s models. Here is an example:

import tiktoken

enc = tiktoken.get_encoding("gpt2") # Load the GPT-2 tokenizer

text = """In a remote mountain village, nestled among the towering peaks and pine forests, \
a solitary storyteller sat by a crackling bonfire. Their voice rose and fell like a melodic river, \
weaving tales of ancient legends and forgotten heroes that captivated the hearts of the villagers gathered around. \
The stars above shone brightly, their twinkling adding a celestial backdrop to the mesmerizing stories, \
as the storyteller transported their audience to worlds of wonder and imagination. """


tokens = enc.encode(text) # Tokenize the text
tokens.append(enc.eot_token) # Add the end of text token

print(tokens[:10])
[818, 257, 6569, 8598, 7404, 11, 16343, 992, 1871, 262]

This turns the text into a list of integers representing subwords or words from GPT-2’s vocabulary. We also append a special <|endoftext|> token to mark the end of the text sequence.

Next, we will rearrange the tokens into a format the GPT model can process. The model expects the input data to be in the form of a batch of sequences. The input tensor should have the shape (B, T), where:

  • B is the batch size (how many sequences we process in parallel).
  • T is the sequence length (number of tokens in each sequence).

To demonstrate, we will create a batch of 5 sequences, each containing 10 tokens, using the first 50 tokens of the input:

import torch

B, T = 5, 10
data = torch.tensor(tokens[:50+1])

x = data[:-1].view(B, T) # Input tensor
y = data[1:].view(B, T) # Target tensor for next token prediction

print(x)
print(y)
tensor([[  818,   257,  6569,  8598,  7404,    11, 16343,   992,  1871,   262],
[38879, 25740, 290, 20161, 17039, 11, 257, 25565, 1621, 660],
[ 6051, 3332, 416, 257, 8469, 1359, 5351, 6495, 13, 5334],
[ 3809, 8278, 290, 3214, 588, 257, 7758, 29512, 7850, 11],
[44889, 19490, 286, 6156, 24901, 290, 11564, 10281, 326, 3144]])

tensor([[ 257, 6569, 8598, 7404, 11, 16343, 992, 1871, 262, 38879],
[25740, 290, 20161, 17039, 11, 257, 25565, 1621, 660, 6051],
[ 3332, 416, 257, 8469, 1359, 5351, 6495, 13, 5334, 3809],
[ 8278, 290, 3214, 588, 257, 7758, 29512, 7850, 11, 44889],
[19490, 286, 6156, 24901, 290, 11564, 10281, 326, 3144, 30829]])

Here, x contains the input tokens, and y contains the target tokens, shifted by one position so that the model can learn to predict the next token in the sequence. During training, these tensors are fed into the model, with the target tensor y used to calculate the loss using cross-entropy loss.

GPT-2 Architectural Diagram

Finally, let’s take a look at the GPT-2 architectural diagram to better understand how the input data moves through the model. Below are some key configuration parameters for GPT-2:

  • Vocabulary Size (V): 50,257
  • Maximum Sequence Length (T): 1024
  • Embedding Dimensionality (C): 768
  • Number of Heads (h): 12
  • Number of Layers (N): 12
  • Batch Size (B): 512

The diagram and configuration will help you trace how the data transforms at each stage, from the input to the output.

GPT-2 Architectural Diagram

Glossary

References

--

--