Feel the burn …
This article is part of a series on GPT-2. It’s best if you start in the beginning. The links are located at the bottom of the page.
The existing resources for GPT-2’s architecture are very good, but are written for experienced scientists and developers. This article is a concept roadmap to make GPT-2 more accessible to technically minded people who have not had formally schooling in the Natural Language Processing (NLP). It contains the best resources I discovered while learning about GPT-2.
- Linear Algebra — specifically matrix multiplication, vectors, and projections from one space onto another space with fewer dimensions
- Statistics — specifically probability distributions
- General Machine Learning Concepts — specifically supervised learning and unsupervised learning
- Neural Nets — general information about how each part works
- Training Neural Nets — general information about how training works specifically gradient descent, optimizers, back propagation, and updating weights.
Relevant concepts that underpin GPT-2 but don’t talk specifically about it. The amount of time after the title is the amount of time that I spent on each resource.
- Activation Functions (1/2 hr) — For each neuron, given an input and a weight (something to apply to that input), it should have a way to decide whether to fire or not. Current best activation functions are softmax (typically only used for output layer), relu, and swish, because they are efficient for computing gradients.
- Softmax Function (1 hr) — The softmax function maps a vector of real numbers to a vector of probability distributions. On the linked page look at the intro and the examples sections.
- Normalization (1 hr)—The act of controlling the mean and variance to make the learning (training) more effective, though the exact mechanics are not well understood. The intuitive explanation is that it makes the loss surface smoother and thus easier to navigate in a consistent way. There are different types of normalization including batch, layer, instance, and group. The transformer architecture uses layer normalization.
- Cross-Entropy Loss (1/2 hr)— Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
For the truly dedicated, the best approach is to learn about transformer class of NLP models (of which GPT-2 is one) and then proceed from there to the GPT-2. They would start with the first in class detailed in Attention is All You Need. The architecture described, the transformer, vastly simplified the existing state-of-the-art architecture by removing recurrent neural nets and convolution. They would then proceed to Open AI’s paper for GPT, Improving Language Understanding by Generative Pre-Training. Open AI simplifies the architecture even more by replacing the encoder-decoder blocks with only decoder blocks. Finally, they would read OpenAI’s paper for GPT-2, Language Models are Unsupervised Multitask Learners. There are very few changes between GPT and GPT-2. GPT-2 mostly just showcases what a transformer can do when deployed with many decoder blocks applied sequentially.
For the pragmatic learner, it is enough to read from abstract through approach sections and skim the results section from Language Models are Unsupervised Multitask Learners.
Transformers and GPT-2 specific explanations and concepts:
- The Illustrated Transformer (8 hr)— This is the original transformer described in Attention is All You Need. Pay special attention to self-attention. Don’t get thrown off by the author addressing embedding with time signal in two separate sections. Remember embedding with time signal actually comes after tokenization.
- The Illustrated GPT-2 (2 hr) — This describes GPT-2 in detail.
- Temperature Sampling, Top K Sampling, Top P Sampling — Ignore the specific implementations in the transformers library and focus on the explanations for the different types of sampling.
Well done! You are through the hardest part. Now it’s time to learn some new tools prior to fine-tuning GPT-2.
Articles in the series:
Everything GPT-2: 0. Intro
Everything GPT-2: 1. Architecture Overview
Everything GPT-2: 2. Architecture In-Depth
Everything GPT-2: 3. Tools
Everything GPT-2: 4. Data Preparation
Everything GPT-2: 5. Fine-Tuning
Everything GPT-2: 6. Optimizations
Everything GPT-2: 7. Production
All resources for articles in the series are centralized in this google drive folder.
(Aside) Think this stuff is pretty tough? So did I! I spent about 14 hours spread out across across 2 days to understand about 80 percent of it, and another month of working with it, rereading the papers and googling to get 15 percent more. I think the last five percent is for PhDs.