LLM Reading List: “Improving Language Understanding by Generative Pre-Training”

9 min readApr 26, 2023

I recently began a project with a simple question:

Can ChatGPT explain itself?

I asked it to tell me what papers I should read in order to understand how Large Language Models (LLMs), specifically GPT, worked. In return I received a list of 5 important papers. Each week we’ll be reading one. I’ll synthesize the key points of the paper, and try to explain it in a way that’s decently understandable.

This is our second week, and we’ll be covering the paper “Improving Language Understanding by Generative Pre-Training,” a 2018 paper by the OpenAI team. Generative Pre-Training laid out the novel technique of pre-training language models on the vast amount of text available on the internet.

While it’s not required reading, I do highly recommend checking out last week’s article, where we covered the groundbreaking paper “Attention is All You Need” (we’ll be talking about Transformers again today, you can go here to brush up).

Make sure you subscribe so you can follow along as each piece comes out!

Introduction and Significance

There are two reasons why Large Language Models are called “large.” The first, obviously, is that the models themselves are huge. GPT-3 has 175 billion parameters, which is more parameters than stars in the Milky Way Galaxy.

The second reason they are called large is because of the sheer volume of data they were trained on. GPT-2 was trained on over 8 million documents, composed of over 40GB of text. GPT-3 was trained on 45TB of text.

If you know a little about machine learning you might be asking yourself “how on earth did they collect 45TB of labeled training data?” The answer of course is that they didn’t. The data is unlabeled.

Quickly, for those of you who may not be familiar, labeled training data has correct answers attached to it. For example, an image of a cat along with the label “cat.” For Natural Language Processing (NLP) tasks, this could be a sentence with each of the parts of speech labelled. We could have the sentence “I throw the ball” along with the tags [Pronoun, Verb, Article, Noun].

In most applications, you have to have labeled data. Just think about it, how could you teach a computer to identify a cat if you didn’t know whether the image it was looking at contained a cat?

Unfortunately, labeling data takes time. You either need to employ humans to do it, or you need develop some way to do it automatically. This is why most models are trained on the same few datasets. It takes a lot of time and expense to establish new ones.

This is the problem that Generative Pre-Training was trying to solve. They set out to answering a simple question:

Can we learn anything from all the unlabeled text floating around on the internet?

If the answer was yes, this would unlock a nearly limitless source of data, making truly large models possible for the first time.

Background

Leading up to the publishing of the Generative Pre-Training paper, the main way that we solved NLP problems was with models adapted to a specific task. If we wanted to create a model that was good at Question Answering, we constructed a model that was custom built for that task and then trained it on a labeled dataset of questions and answers.

While this approach makes sense, it has a pretty obviously problem, a question answering model doesn’t know how to tag parts of speech, detect if one sentence is a paraphrase of another, or classify movie reviews as positive or negative. It lacks what’s called language understanding.

A model with language understanding can do more than one specific task, because it uses language in a way similar to how humans do. It understands the meaning of words, and how they relate to each other both in the same sentence and in the wider context of a text or a conversation. This is an incredibly difficult task. The same words mean different things in different contexts, and let’s not even think about concepts like metaphor or sarcasm.

If you wanted to capture examples of these kind of subtleties, you would need a truly enormous quantity of training data. Enough text that the model could learn all the possible relationships between words. This amount of text does exist, but it’s unstructured and unlabeled. It can be found in twitter threads, blog posts, and Reddit responses.

The authors of the Generative Pre-Training paper identified two specific problems:

It’s unclear how to most effectively learn useful information from the unstructured data
Once the information is learned, it’s not obvious how to transfer the knowledge to a specific task.

Previous work on the topic had generally taken one of three forms:

Semi-supervised learning, which tried to use the unlabeled data to learn word embeddings which could be used as features in other models. This is a process we touched on briefly in the article on Attention with regards to the word2vec process.
Unsupervised pre-training which attempted to use the unlabeled data to find the optimal starting point for a supervised model.
Auxiliary training objectives, or giving a model more than one task to do at the same time. This second task had been shown to improve model understanding of language as a whole, leading to better performance on the initial task.

While all three of these approaches had resulted in improvements, they were limited in the range of their understanding, usually limited to the context of a sentence or even a single word.

Generative Pre-Training

As the name of the paper suggests, generative pre-training was the novel method the authors took to effectively discovering and utilizing the information in unlabeled textual data. It’s a pretty simple approach.

Firstly, you pre-train a model to do one basic task. Given an unsupervised set of words, generate the next word. The major advantage of training for this task is that the text requires no labeling, since all you’re doing is guessing the next word.

The specific model they trained on this task was a Transformer, which we discussed last week. Specifically they used only the decoder side, which takes the embedding of the previous word and runs it through the attention mechanism.

If you remember from last week, one of the big advantages of using a transformer is that it can quickly and easily learn long distance relationships between words. So instead of learning the context of a single word in a single sentence, it can learn how to generate the next word based on the context of an entire book.

This is all well and good, but there is little advantage to doing this kind of pre-training if you can’t use the model you’ve created for anything useful. It is the second step of this process that was the real breakthrough.

The authors of the paper realized that these generatively pre-trained models, had taken a huge step towards language understanding. In the process of learning how to guess the next word, the model had learned about how words relate to each other in context.

This meant you could take the exact same model from step one and fine-tune it to whatever task you needed. All that was necessary was a limited number of labeled examples and a little tweeking to get everything in the right format. Lets look at a couple of tasks they examined.

Classification and Entailment

Two of the simpler tasks are classification and entailment. The quintessential classification task is movie reviews. Read a movie review, tell me if it’s positive or negative. Entailment just means reading two statements, the premise and the hypothesis, and deciding if one entails, contradicts, or is neutral, towards the other. For example:

Premise: The cat is sleeping on the bed.
Hypothesis: The bed has an animal on it.
Result: Entailment

In order to get the generatively pre-trained models to work on these tasks, all you need to do was feed in the appropriate text, and run the output of the pre-trained model through a traditional linear neural network layer to output your prediction.

Similarity and Multiple Choice

The tasks of deciding if two sentences were similar or answering multiple choice questions were only slightly more complicated. For similarity, you paste the two sentences together and run them through the transformer. Since there is no inherent order to the sentences, the paper authors chose to try both possibilities and then paste them together.

For multiple choice you do something similar. You paste together the context and one of the answers, then run it through the transformer. The output is passed through a linear layer, with the process repeated for each possible answer. The answers are compared to each other to decide which is correct.

Results, Significance and Impact

I’m going to be honest, it’s almost unfair how much better this method was than previous approaches. For Entailment, the generatively pre-trained models beat the state-of-the-art in 5 of 6 datasets. For one dataset it showed a 5.9% higher accuracy than the previous best. In the Question Answering task it was better than all previous models. For Similarity and Classification tasks it was better in 4 of 6 cases, including over 10 percentage points of improvement on one benchmark classification task.

I must remind you, the models being beaten in this case were specifically trained for these tasks. In fact the state-of-the-art “models” were in many cases actually collections (also called “ensembles”) of 5 to 10 models all working together. If you’re not super familiar with machine learning papers, this is not normal. These improvements were huge.

Photo by Suzanne D. Williams on Unsplash

The ramifications of this paper were huge. Suddenly the massive treasure-trove of data on the internet was usable and useful, not just on one specific task, but on potentially on every task. The stage was set for Large Language Models to be born.

Conclusion

If you’ve been paying close attention, you’ll realize we now have both the “GP” (generative pre-training) and the “T” (transformers). Next week we’ll be cracking into the monster itself with the paper that gave us the GPT-3 model “Language Models are Few-Shot Learners.”

(Hint: It’s out now! Here’s a Link to that article)

I look forward to seeing you there. Don’t forget to subscribe to get notified when I send it out!

The Author

With a Bachelors in Statistics, and a Masters of Data Science from the University of California Berkeley, Malachy is an expert on topics ranging from significance testing, to building custom Deep Learning models in PyTorch, to how you can actually use Machine Learning in your day to day life or business.

References

Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877–1901.

Jurafsky, D., & Martin, J. H. (2022). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson.

NASA. (2015, December). Imagine the universe! NASA. Retrieved April 25, 2023, from https://imagine.gsfc.nasa.gov/science/objects/milkyway1.html

Papers with code — GPT-3 explained. Explained | Papers With Code. (n.d.). Retrieved April 25, 2023, from https://paperswithcode.com/method/gpt-3

Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018).

Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.