LLM Reading List: GPT-3 and “Language Models are Few Shot Learners”

11 min readMay 9, 2023

I recently began a project with a simple question:

Can ChatGPT explain itself?

The bot gave me 5 important papers to read, and I’ll be covering one each week, synthesizing the key points of the paper, and explaining it in a way that’s decently understandable.

This week we’ll be talking about “Language Models are Few Shot Learners,” which is the paper that first introduced GPT-3 to the world. This is gonna be a big one folks, hold on to your hats.

In previous weeks we’ve covered the foundations of Transformers in “Attention is All you Need,” and how we can learn from the unstructured text of the internet in “Improving Language Understanding by Generative Pre-Training.” They aren’t required reading, But if you really want to understand where GPT-3 came from, I recommend you check them out.

Make sure you subscribe so you can follow along as each piece comes out!

Introduction

First of all I hope you all know that I love you, because this paper is long. I’m talking 75 pages long. To be fair, about 30 pages of that is appendices, but to be even more fair the Attention and Generative Pre-Training Papers were only 27 pages put together (including citations).

That’s not to say it’s not long for good reason. An OpenAI project with 31 authors, the document covers everything from the GPT-3 model architecture, to how much electricity was required to train it, to the ethical ramifications of creating a model that can produce text which is indistinguishable from a human.

While the main thing most people took away from this project was “Wow, GPT-3 sure is powerful!” it’s important to recognize the actual hypothesis that the authors were trying to test. They were in essence attempting to solve one very specific problem, and they were using a pretty basic solution to do so.

If you’ll recall from Generative Pre-Training, the revolutionary strategy they employed in that paper was taking a model, pre-training it on unlabeled and unstructured text, and then fine-tuning it to specific tasks. Fine tuning in that case meant giving the pre-trained model a much smaller, very specific set of labeled training data which demonstrated the task at hand. For example, a set of questions and answers to show the model how to do the Question Answering task.

But what if we were really lazy, and even providing that small labeled dataset was too much work? I’m being a little cheeky, but the authors of the Few-Shot Learners paper make a few good points about the problem of needing even a little bit of labeled data to fine tune on.

There are a bunch of potential uses for a language model. Requiring a new dataset every time we invent a new use is limiting.
Fine tuning on a labeled sample makes the model too dependent on what it sees in that labeled sample. If you fine tune it to answer Yes/No questions about a dataset, it doesn’t know what to do when it has the options of Yes/No/Maybe.
This isn’t the way that human language understanding works. You don’t need 5000 examples of how to answer a question in order to understand the task.

Photo by Nguyen Dang Hoang Nhu on Unsplash

What would be ideal would be something like what you see on a standard test at school. You open the test booklet, and read the following:

“Please read the paragraph below about a proposed mining project. Do you think the author is for or against the new mine?”

That’s all you get. And yet you know how to complete the task don’t you? You have used the text of the question itself to learn how to answer it. This is what’s called in context learning and the question itself is the natural language prompt. If we could teach language models to handle tasks the same way, we wouldn’t need fine tuning.

Unfortunately, it turns out this is really hard to do. Previous models that had attempted to use this method, while very interesting, tended to perform poorly compared to fine-tuned models on the same tasks.

Luckily, the authors of Few Shot Learners had a subtle, nuanced solution: What if we just made the model real big?

Few Shot Learning

To really understand the focus of the paper, it’s important that we go over a few definitions first. The approach we discussed above, where we are simply handed the task without any examples and expected to complete it, is called zero shot learning. If we provide the question and then a single example of what me mean, such as the question below, this is called one-shot learning.

Please fill in the blank with the appropriate word. For example:
Down is to up as left is to right.

The logical extension of this is few-shot learning where we provide the question or task description, and a few examples of the task. A few in this case means “as many as will fit in the model’s context window.” A context-window is just the number of characters the model can consider at once. So if the model usually looks at 500 characters, we provide it up to 500 characters of examples.

The advantage of this method over the fine tuning method is that any individual can presumably type out a few examples of what they’re looking for. You don’t have to build a whole dataset around it.

Being Human-Like

You could also ask what’s “closest” to how humans actually interact with the world. This is not as straightforward as it seems. When we address a question such as translation, this is clearly something humans handle as a zero shot task. When you see a sign in a foreign language, there isn’t an example on the sign showing you how translation works.

From Language Models are Few Shot Learners https://arxiv.org/abs/2005.14165

But let’s say your boss asks you to create a company branded PowerPoint presentation. You’re going to do a really poor job of it unless you see at least one example of what “company-branded” means. That’s a one shot task. You get an example.

You might be saying to yourself “I would probably do an even better job if I had more than one example.” This would be few-shot learning. The authors consider all of these methods to be “human-like” behavior.

Pros and Cons of Human-Like Behavior

The authors of the Few-Shot paper are not aiming for human-like behavior from their models simply because they think humans do it best. There are a few advantages they are after.

Firstly, it allows humans who use a model to interact with it via “natural-language.” This just means that you can talk to the model the same way you can talk to a normal human being. Secondly, it allows for the model to move fluidly from one type of task to another. For example, to translate a paragraph and then summarize it. This is something a more specific or fine-tuned model simply cannot do.

There are a few downsides to the few-shot method specifically in the form of accuracy. We’ll discuss GPT-3 model performance more in the next few sections, but it’s important to note that fine-tuned models fitted to a specific task will likely always outperform few-shot models given the same amount of training data. Few-shot models are also vulnerable to a whole new type of inaccuracy, stemming from poorly worded questions. There is not much the model can do if the task it is given is incredibly unclear. This is also true of fine-tuned models, but since they only have one task, this is much less likely to occur.

The Importance of Scale

GPT-3 was a massive undertaking. It truly epitomized “large” both in terms of the amount of data it was trained on and the number of parameters in the model.

Dataset Size

As I discussed thoroughly in a previous article, the “GP” in GPT-3 stands for “Generative Pre-Training,” which is a method to fit a language model to unstructured and unlabeled text. This makes it possible to train on literally anything that is available on the internet. The authors of the Few-Shot Learners paper took advantage of this to use a truly astounding amount of data.

They began with the Common Crawl dataset from 2016–2019. This is the closest thing to “all the text on the internet” that exists at the moment, and it is comprised of over 1 trillion words and 45TB of plaintext. From here they filtered and augmented it in 3 ways.

They compared documents in Common Crawl to a smaller subset of “high quality” writing, taking only those documents that compared favorably
They deduplicated Common Crawl as much as possible to avoid too much repetition
They added 4 additional high quality curated datasets: a collection of websites (WebText), two collections of English language books (Books1 and Books2), and all of English language Wikipedia

The format of the text was also changed to reduce the size. The end result was around 570GB of text and 400 billion words. The authors also chose not to sample everything equally. The “high quality” data sources were sampled more often, and the lower quality ones less often during training. So some data sources were seen more than once, while others were not even completely seen once. We can see the distribution in the table below.

Model Size

As we’ve mentioned, models attempting to the few-shot method had performed pretty poorly in the past. So why did the authors think that their new model would be different? Well, they had noticed a simple trend. As previous models grew larger it seemed like they performed better. Not only that, but it seemed like that increase in size caused the models to learn more from each example they were given.

Logically, the question becomes whether or not this trend continues. If we make a ridiculously large model, say 175 billion parameters, would it be ridiculously better? Spoiler alert, yes.

The researchers set out to demonstrate this idea by creating 8 different models, each of varying sizes, but all using roughly the same architecture. All of the models are Transformers as we’ve discussed a few times before. Here are the specifications for each one if you’re interested. The largest of the models is what the researchers call GPT-3.

As for how they did, check out the chart below. We can see that as the number of parameters in the model increases, so does the accuracy. Additionally, we can see that the growth in accuracy is steeper for one shot or few shot learning than for zero shot. This is evidenced by the widening gap between them.

Let’s look at this another way to make sure we understand it. The next plot shows us how three differently sized models respond to being given more examples to learn from. The 175 billion parameter model represented by the blue line is GPT-3. You can notice immediately that the GPT-3 gains the most from a single example.

Essentially the finding is that the model size was a significant bottleneck to accuracy on these few-shot tasks. Making the model bigger not only improved it, but also seems to increase how much it benefits from additional in-context examples.

Results and Some Limitations

This is where the bulk of the 40 pages of non-appendix length comes from. GPT-3 was tested for performance on over two dozen different NLP datasets composing 9 different groups of tasks. I’ll just highlight a few of the interesting points.

In a test of understanding long range dependencies (LAMBADA) the GPT-3 model was able to not only outperform state-of-the-art by 18%, but the few-shot method solved a common problem of models not providing a one word answer.
While most of GPT-3s training data was in English, it also contained a small amount of other languages. The model was able to outperform all other unsupervised language models when translating German, French or Romanian into English. It was still very poor at translating from English, and was also worse than models custom made (supervised) for the task.
GPT-3 performs very poorly at comparison tasks that require it to look at two sentences. For example it has a difficult time determining if two words are being used with the same meaning in two different sentences, or if one sentence entails the other.
When asked to generate news articles, GPT-3 thought at first that the prompts it was given were tweets, and would generate follow up comments. Once the researchers fixed this behavior (through better examples), GPT-3 produced content that was indistinguishable from real human written news articles. People trying to tell the difference between the two did no better than chance.

Photo by Possessed Photography on Unsplash

For all of these, the authors point out, it is nearly impossible to tell if the model is actually learning the task from scratch given the few-shot examples, or if it is simply matching the task description to tasks it saw in training. This is not necessarily a limitation, as we’re not entirely sure which method humans use either, but it is a weakness of the current understanding.

Likewise, as with any really complex model, we don’t know and most likely can’t know why GPT-3 answers the way it does. Explainable AI is the holy grail of the field, but the truth is that as the models get more and more complex, our understanding of the “reasoning” behind their decisions gets less and less.

Conclusion

There is so much more in this paper. While we talked about zero, one, and few-shot prediction, along with the benefits of massively increasing scale, I had to leave out a lot to make this article a reasonable length.

We have not even begun to discuss the amount of compute power required to train the model, or the ethical issues and biases they found in GPT-3. While these are extremely good topics, and I encourage you to read the GPT-3 paper sections on them, we’ll be covering each in depth in the next two weeks.

Specifically, next week we’ll be diving into the nitty gritty of how you make a model large. How long it takes, how much compute it uses, and even how to estimate model performance as the size grows! Join me as we read “Scaling Laws for Neural Network Models,” and don’t forget to subscribe to get updates on everything I write.

The Author

With a Bachelors in Statistics, and a Masters of Data Science from the University of California Berkeley, Malachy is an expert on topics ranging from significance testing, to building custom Deep Learning models in PyTorch, to how you can actually use Machine Learning in your day to day life or business.

References

Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877–1901.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. & Sutskever, I. (2018). Language Models are Unsupervised Multitask Learners. , .