# Everything GPT-2: 2. Architecture In-depth

Oct 20 · 4 min read

Feel the burn …

The existing resources for GPT-2’s architecture are very good, but are written for researchers so I will provide you will a tailored concept map for all the areas you will need to know prior to jumping in.

Areas that the reader should already know, i.e. areas I won’t specify the resource for:

1. Linear Algebra — specifically matrix multiplication, vectors, and projections from one space onto another space with fewer dimensions
2. Statistics — specifically probability distributions
3. General Machine Learning Concepts — specifically supervised learning and unsupervised learning
4. Neural Nets — general information about how each part works
5. Training Neural Nets — general information about how training works specifically gradient descent, optimizers, back propagation, and updating weights.

Areas that the reader should learn before proceeding, i.e. areas I will specify a resource for:

1. Activation Functions (1/2 hr) — For each neuron, given an input and a weight (something to do to that input), it should have a way to decide whether to fire or not. Current best activation functions are softmax (typically only used for output layer), relu and swish, because they are efficient for computing gradients.
2. Softmax Function (1 hr) — The softmax function maps a vector of real numbers to a vector of probability distributions. On the linked page look at the intro and the examples sections.
3. Normalization (1 hr)—the act of controlling the mean and variance to make the learning (training) more effective, though the exact mechanics are not well understood. The intuition is that it makes the loss surface smoother and thus easier to navigate in a consistent way. There are different types of normalization including batch, layer, instance, group, and others. The transformer architecture uses layer normalization.
4. !Cross-Entropy Loss! (1/2 hr)— Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

For the truly dedicated reader, the best approach is to learn about transformer class of NLP models (of which GPT-2 is one) and then proceeding from there to the GPT-2. The reader would start with the first in class detailed in Attention is All You Need. The architecture described, the transformer, vastly simplified the existing state-of-the-art architecture by removing recurrent neural nets and convolution. They would then proceed to Open AI’s paper for GPT, Improving Language Understanding by Generative Pre-Training. Open AI simplifies the architecture even more by replacing the encoder-decoder blocks with only decoder blocks. Finally they would read OpenAI’s paper for GPT-2, Language Models are Unsupervised Multitask Learners. There are very few changes between GPT and GPT-2. GPT-2 mostly just showcases what a transformer can do when deployed with many decoder blocks applied sequentially. For the pragmatic reader, it is enough to read from abstract through approach sections and skim the results section from Language Models are Unsupervised Multitask Learners.

Transformers / GPT-2 Specific areas the reader should learn before proceeding:

1. The Illustrated Transformer (8 hr)— This is the original transformer described in Attention is All You Need. Pay special attention to self-attention. The author addresses embedding with time signal in two separate sections. Remember this is actually comes after tokenization.
2. The Illustrated GPT-2 (2 hr) — This describes. GPT-2 in detail.
3. Temperature Sampling, Top K Sampling, Top P Sampling — Ignore the specific implementations in the transformers library and focus on the explanations for the different types of sampling.

Well done! You are through the hardest part. Now it’s time to learn some new tools prior to fine-tuning the GPT-2.

Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: Production (forthcoming)

All resources for articles in the series are centralized in this google drive folder.

(Aside) Think this stuff is pretty tough? So did I! It took me about 14 hours of intensive studying across 2 days to get to about 80 percent and another month of working with it, rereading the papers and googling to get 15 percent more. The last five percent is for the PhDs.

Have you found this valuable?
If no, then tweet at me with some feedback.
If yes, then send me a pence for my thoughts.

Written by