GPT1, GPT2 and GPT3: All probabilistic generative models!

8 min readJul 19, 2023

In General :

what is language model?

A language model is a probability distribution over sequences of words.[1] Given any sequence of words of length m, a language model assigns a probability

to the whole sequence.

Without doubt, GPT-1, GPT-2 and GPT-3 are all language models, and they are trained to predict the next word given a sequence of words.

But what are the differences among them?

Let me first explain it in a more intuitive way:

GPTs consist of an immense number of parameters, which we can somewhat equate to a person’s ability to learn. Let’s think them as students who solely focus on the materials available to them. GPT-1 is like an 8-year-old child who only spends time reading storybooks. GPT-2 can be likened to a 12-year-old child who primarily browses high-quality posts on Reddit. GPT-3, on the other hand, represents a 20-year-old student who delves into almost everything available on the internet for learning.

Upon completing their learning, it becomes evident that they possess the ability to predict the next word based on the given sequence. For instance, if you provide the sequence “It is so nice to meet __” to any of the GPTs, they could effortlessly respond with “you.”

However, let’s delve deeper into their capabilities. It is reasonable to infer that GPT-1 might not be so good in tasks like Q&A or translation, unlike GPT-2 and GPT-3. The reason lies in the analogy of a person who solely reads storybooks being asked to explain a math formula. It would certainly pose a challenge since the knowledge required to answer such a question lies outside the scope of their reading material(What if you only read story books and I ask you to explain a math formula to me? That must be hard for you, right?). Similarly, GPT-1’s limited training data might not encompass the breadth of information needed for more complex language tasks, which GPT-2 and GPT-3, with their broader exposure to diverse online content, might be better equipped to handle.

Indeed, the fact that GPT-2 reads numerous posts on Reddit, where many conversations take place in the form of replies, gives it the unique ability to understand how to engage in conversations with people. As a language model, GPT-2 has been trained to predict the next word based on the input it receives. By learning from a wide range of conversations on Reddit, it becomes familiar with the structure and patterns of dialogues, making it adept at generating responses in a conversational format.

When you provide GPT-2 with a sentence like “Can you explain what is Machine Learning to me?” its ability to predict the next word will prompt it to respond with knowledge about Machine Learning. The reason for this lies in the probability distribution of words within the conversations it has learned from. Since GPT-2 has extensively encountered discussions and information related to Machine Learning in its training data, the words associated with this topic are more likely to have a higher probability of being predicted as the next word.

Same for GPT-3.

In a sense, the difference among these models lies in their knowledge base and number of parameters. GPT-1, being the youngest, has limited exposure and may struggle with tasks beyond its training data. GPT-2, having delved into Reddit conversations, gains the ability to engage in more natural and coherent discussions. GPT-3, with its extensive training on diverse internet data, possesses a broader understanding and can provide more sophisticated and contextually appropriate responses, also it have the ability to tackle more NLP tasks such as translation, Q&A….

In Specific:

GPT-1:

Training consists of two parts:

First: Unsupervised Pre-Training: used 7000 unreleased books for training. The unsupervised pre-training of GPT-1 is conducted based on language model training. Given an unlabeled sequence U = {u1, …, un}, the optimization objective of the language model is to maximize the likelihood:

Then: Fine-Tuning：
After obtaining an unsupervised pre-trained model, we directly apply it to a supervised task. For a labeled dataset C, each instance has m input tokens: {x1,…,xm}, along with their corresponding labels y. Firstly, these tokens are fed into the pre-trained model, resulting in a final feature vector h. Then, a fully connected layer W is used to generate the predicted output y. Notice, we only train W and the Delim embeddings. Our loss function is defined as the sum of L1 and lambda multiplied by L2.

For different tasks, GPT-1 employs different processing methods to process its inputs:

Dataset: GPT-1 utilizes the BooksCorpus dataset, which consists of 7,000 unpublished books. There are two reasons why the authors chose this dataset: 1) The dataset provides longer context dependencies, allowing the model to learn longer-term dependencies. 2) Since these books are unpublished, they are less likely to be encountered in downstream datasets.

Model Details:

Used byte pair encoding(BPE), 40,000 in total
word_embedding with length = 768
Learnable positional embedding with length 3072
12*Transformer Decoder
Used GLEU as activation function
Batch_size = 64, Lr = 2.5e-4, seq_len = 512
Number of parameters: 1.17B

GPT-2:

There is one thing different from GPT-1, that is the learning objective of GPT-2 is to utilise unsupervised pre-training models for supervised tasks. Author believes that a good language model could do multi-task learning while training. In contrast, GPT-1 need specific fine tuning on relevant tasks.

The goal of GPT-2 was to train a language model with stronger generalisation ability, the author didnot innovating or redesigning the architecture of GPT-1. Instead, it just add more parameters and used a much bigger dataset. Meanwhile, the author want to GPT-2 to model P(output| input, task). Also, he believes that when a language model has a large enough parameters, it is sufficient to cover all supervised tasks, meaning that all supervised learning is a subset of unsupervised language models. For example, after training the language model on the corpus “Michael Jordan is the best basketball player in history,” it can also perform a question-answering task like: (Question: “who is the best basketball player in history?”, Answer: “Michael Jordan”).

NOTICE: In the original paper Language Models are Unsupervised Multitask Learners, all results are achieved without any fine tuning.

Dataset: The data for training GPT-2 is derived from highly upvoted articles on Reddit, collectively known as WebText. The dataset comprises approximately 8 million articles, totaling around 40GB in size. To prevent any conflicts with the test set, articles associated with Wikipedia have been removed.

Model Details:

Similarly, Byte Pair Encoding is used to construct a dictionary with a size of 50,257
The size of the sliding window is 1024
batch size = 512
Layer Normalization has been moved to the input section of each block, and an additional Layer Normalization is added after each self-attention.
The residual layer’s initialization values are scaled using 1/sqrt(N), where N is the number of residual layers.

** compared with GPT-1, it has a much more powerful zero shoot ability**

GPT-3:

175 billion parameters

45TB training data

No need for fine-tuning and in-context learning is all you need.

The most important notion in GPT-3 is In-context learning. Unlike traditional machine learning approaches that require time-consuming fine-tuning for specific tasks, GPT-3’s in-context learning allows it to continuously improve its performance simply by observing new examples and refining its responses based on context. This adaptive learning process ensures that the model can handle a wide range of tasks effectively, from language translation and summarization to question-answering and code generation.

*In-context learning: you provided few examples in your prompts, and GPT-3 will learn from it and generate the answer you want.

However, GPT-3 still fails to achieve satisfactory performance on some tasks, especially following user’s instruction or Its responses may contain unsafe content or inappropriate language.

SUMMARY

In conclusion, GPT-1, GPT-2, and GPT-3 represent significant milestones in the development of language models and natural language processing. While GPT-1 laid the groundwork with a transformer-based architecture and demonstrated the potential of large-scale language models, GPT-2 took a leap forward by increasing the model size and dataset, showcasing improved performance and more advanced generalization capabilities. However, it was GPT-3 that truly pushed the boundaries of what language models could achieve.

Despite their remarkable progress, each iteration of the GPT series has its limitations. GPT-1 lacks the scale to handle complex tasks comprehensively, GPT-2 has constraints in handling certain specific tasks, and while GPT-3 showcases groundbreaking abilities, it may still produce responses with unsafe or inappropriate content.

As we move forward, the development of language models continues to evolve, addressing the challenges and limitations faced by previous versions. GPT-3 has paved the way for even more sophisticated models, and with ongoing research and advancements, we can anticipate even more impressive language models on the horizon. As the journey continues, the potential for these models to revolutionize various fields and applications is boundless, promising exciting opportunities for the future of natural language processing.

Reference:

*This article used ChatGPT to improve sentence structure.
Improving Language Understanding by Generative Pre-Training
Language Models are Unsupervised Multitask Learners
Language Models are Few-Shot Learners