Explaination of GPT1, GPT2 and GPT3

11 min readMar 27, 2023

As a large language model based on the GPT-3.5 architecture, ChatGPT is a perfect example of the capabilities of GPT technology. ChatGPT has been trained on a massive corpus of text, allowing it to understand natural language and engage in human-like conversations. But all these technologies started from the paper Attention we all need and all GPT series start from the decoder part of the Attention we all need paper.

A group of autoregressive language models is called the Generative Pre-trained Transformer (GPT) by OpenAI.
It is trained to anticipate what the next token will be using the “generative pretraining” training method. For numerous text-based tasks, the model showed strong few-shot learning.
Language models (LMs) had to be trained on a lot of precisely labeled data, which was difficult to get by before GPT. These LMs performed admirably on the specific supervised task for which they were trained, but they were difficult to quickly domain-adapt to other tasks.

Let’s take a closer look at each GPT1, GPT2, and GPT3 and how they each contributed to Natural Language Processing tasks in the sections below.

GPT-1 paper (Improving Language Understanding by Generative Pre-training).

Previous to this research, the majority of cutting-edge NLP models were supervised learning trained particularly for a given goal, such as sentiment categorization or textual entailment. Nonetheless, supervised models have two significant drawbacks:

i. They require a significant amount of labeled data, which is frequently difficult to find, in order to master a specific task.

ii. They are unable to generalize to tasks for which they were not specifically taught.
In this study, it was suggested that unlabeled data be used to train a generative language model, which would then be tuned using examples of downstream tasks like textual entailment, sentiment analysis, and classification.

How Learning happens in GPT 1

a) Unsupervised Language Modelling (Pre-training): For unsupervised learning, a standard language model objective was used.

where T was the set of tokens in unsupervised data {t_1,…,t_n}, k was the size of the context window, and θ were the parameters of the neural network trained using stochastic gradient descent.

OpenAI released GPT-1 in 2018. There were 117 million parameters in it.
This generative language model was able to learn a huge range of dependencies and gain extensive knowledge on a diverse corpus of contiguous text and extended stretches after being trained on the big BooksCorpus dataset. (source)
The 12-layer decoder from the original transformer architecture, which includes self-attention, is used by GPT-1.

Unsupervised Learning happens in the following fashion:

Input embedding: The input sequence is converted into a set of fixed-length vectors through an embedding layer.
Positional encoding: The positional encoding is added to the input embeddings to give the model information about the order of the tokens in the sequence.
Multi-head self-attention: The decoder attends to the input sequence through multiple heads of self-attention, where each head focuses on a different aspect of the input sequence.
Feedforward layer: The output of the self-attention layer is fed through a feedforward layer, which applies a non-linear transformation to the data.
Normalization: The output of the feedforward layer is then normalized using layer normalization.
Output generation: The output sequence is generated one token at a time, with each token being generated based on the previously generated tokens.

During training, the decoder in GPT-1 is optimized using a technique called backpropagation, where the error in the output sequence is propagated back through the network to adjust the model’s parameters. The training process is repeated multiple times with different subsets of the data to improve the model’s accuracy and generalization ability.

b) Supervised Learning

The parameters are adjusted to the supervised target task after the model has been trained with the previously described goal, i.e. maximizing the likelihood. Each instance of the labeled dataset C has a sequence of input tokens, x1,…, xm, together with a label, y. In order to obtain the final transformer block’s activation, the input is then routed through the pre-trained model, and the result is injected into an additional linear output layer with parameters Wy in order to estimate y:

c) Task Specific Input Transformations

We may immediately fine-tune the model for applications like text classification, as was already mentioned. However, jobs with structured inputs, such as textual entailment, question answering, etc., call for task-specific modification.

Inputs to the specific downstream tasks are converted into ordered sequences to minimize changes to the model’s architecture during fine-tuning. Tokens are reorganized in the following manner:

The input sequences include start and end tokens.
In order to pass the input as an ordered sequence, a delimiter token is also introduced between various sections of the example.
Several sequences are sent for each example in activities like question-and-answer (QA), multiple-choice questions (MCQs), etc.

GPT-1 performed better than specifically trained supervised state-of-the-art models in 9 out of 12 tasks in the models that were compared on.

Another significant achievement of this model was its decent zero-shot performance ( no need to fine-tune, we can directly use the model output ) on various tasks. The paper demonstrated that the model had evolved in zero-shot performance on different NLP tasks like question-answering, schema resolution, sentiment analysis, etc. due to pre-training.

GPT-2 paper (Language Models are unsupervised multitask learners) and its improvements over GPT-1.

The improvements to the GPT-2 model mainly involved utilising a larger dataset and more parameters to the model in order to create an even more powerful language model.

GPT-2, which was released by OpenAI in February 2019, used a larger dataset and added other parameters to create a more reliable language model.
With 1.5 billion parameters and 10 times the data of GPT-1, GPT-2 grew ten times bigger than GPT-1.

The two key ideas covered in this work in the context of NLP are listed below as learning objectives and concepts.

Task Conditioning:

A language model’s training objective is commonly written as P(output|input).
With GPT-2, several tasks were to be learned using a single unsupervised model.
The learning objective is changed to P(output|input, task) to do this, where the model is conditioned on the task it must complete.
Task conditioning is the process of teaching a model to generate several outputs for a given input depending on the task.
When both the input and the task are provided to the model at the architectural level, certain models incorporate task conditioning.
The job, output, and input for language models are all natural language sequences.
Examples of plain-language instructions are given to language models to perform task conditioning on them.
Task conditioning forms the basis for zero-shot task transfer, where the model can perform a task it hasn’t been explicitly trained on.

Zero-Shot Learning and Zero Short Task Transfer

GPT-2 has the capacity to transmit zero-shot tasks, which means that it can carry out a task without being given any examples.
When there are no examples offered and just instructions are given, zero-shot learning is a special instance of zero-shot task transfer.
Instead of just rearranging sequences like in GPT-1 for fine-tuning, GPT-2 provided input in a predefined format that anticipated the model to comprehend the nature of the task and provide solutions.
For example, to translate from English to French, the model was given an English sentence followed by the word “French” and a prompt (:), and it was expected to understand that it was a translation task and provide the French counterpart of the English sentence.

Note :

Zero shot learning — The capacity of a model to do a task without having previously seen any examples of that kind is known as zero shot learning or behaviour. During zero shot learning, gradients are not updated, and the goal is for the model to comprehend the problem without consulting any instances.
Zero Shot Transfer — When the model is given little to no examples to help it comprehend the task, this is referred to as zero shot task transfer or meta-learning. The phrase “zero shot” refers to the absence of gradient changes. Based on the instructions and examples, the model is meant to comprehend the task.

GPT-2 was able to achieve state-of-the-art results on 7 out of 8 tested language modeling datasets in zero-shot.

In GPT-2, it was demonstrated that training on a larger dataset and using more parameters enhanced the language model’s capacity to comprehend tasks and outperform the state-of-the-art on many of them in zero-shot scenarios.

GPT 3: One of the most potent models NLP has seen to date is the GPT-3 paper (Language models are few-shot learners) and the improvements.

It is a sizable language prediction and generation model created by OpenAI that can produce lengthy passages of the original text.
GPT-3 eventually emerged as OpenAI’s ground-breaking AI language program. (source)
GPT-3 can create sentences and paragraphs that essentially sound like they were written by a person.
GPT-3 is 100 times larger than GPT-2 and has 175 billion parameters. It was trained using the “Common Crawl” 500 billion word data collection.
In addition, GPT-3 is capable of carrying out additional intelligent activities and writing code snippets like SQL queries. Unfortunately, because of its 175B-parameter size, performing inference is expensive and inconvenient.
The fine-tuning phase required by GPT-3’s predecessors as well as by encoder models like BERT is no longer necessary.
It was trained on only one specific task which was predicting the next word, and thus, is an unsupervised pre-trained model.
It is a sizable language prediction and generation model created by OpenAI that can produce lengthy passages of the original text. GPT-3 eventually emerged as OpenAI’s ground-breaking AI language program. (source)
GPT-3 can create sentences and paragraphs that essentially sound like they were written by a person.
The model has a context window of, let’s say, 2048 tokens, and each word runs via its own track. Hence, both the input and the output must fall within this range. Although there are ways to increase and modify this number, we’ll use it for now.
Only its own track is used to process each token.

How the dataset is prepared to train a GPT 3 for text and code generation.

Training data generation for text generation

Training data generation for code generation

This is how GPT-3 produce output :

Learning Objectives and Concepts: Let us discuss the two concepts discussed in this paper.

In context Learning

Large language models develop pattern recognition and other skills through training on text data.
The primary objective of language models is to predict the next word given context words, but they also learn to recognize patterns in the data to minimize loss.
This ability to recognize patterns helps the model during zero-shot task transfer, where it can use its past knowledge to perform tasks it has not explicitly been trained on.
The power of this capability increases with the number of parameters of the model.

Few Shot, One shot Zero shot Setting —

In a few-shot setting, the model is provided with a task description and as many examples as fit into the context window of the model. In a one-shot setting, the model is provided exactly one example and in the zero-shot setting, no example is provided. With the increase in capacity of the model, the few, one, and zero-shot capability of the model also improve.

Performance of GPT-3

A variety of language modelling and NLP datasets were used to evaluate GPT-3. In a few or zero-shot settings, GPT-3 outperformed cutting-edge methods for language modelling datasets like LAMBADA and Penn Tree Bank. Although it couldn’t surpass the state-of-the-art for other datasets, it did enhance zero-shot state-of-the-art performance. On NLP tasks like closed book question answering, schema resolution, translation, etc., GPT-3 again performed admirably, frequently outperforming or coming close to well-tuned models.

Weaknesses of GPT-3 discussed in the paper

Lack of common sense knowledge: Despite its impressive ability to generate coherent text, GPT-3 often lacks common sense knowledge that humans take for granted. This can lead to generating text that is factually incorrect or even nonsensical in certain contexts.
Limited long-term memory: While GPT-3 is capable of processing a large amount of text input, it has limited long-term memory, which can make it difficult for the model to maintain coherence and consistency over longer texts or conversations.
Bias and sensitivity to input data: Like any machine learning model, GPT-3 is sensitive to the data it is trained on and can exhibit bias as a result. For example, if the training data contains a disproportionate amount of text from certain demographics, the model may be more likely to generate biased or insensitive responses.
Lack of control over generated output: While GPT-3 is capable of generating highly coherent and contextually appropriate text, it can also generate responses that are inappropriate, offensive, or otherwise problematic. This lack of control over the generated output is a concern for applications where the quality of the output is critical.
High computational requirements: GPT-3 requires significant computational resources to train and run, which can make it challenging to use in certain contexts or for smaller organizations with limited resources.
Limited ability to understand complex reasoning and logic: GPT-3’s ability to understand complex reasoning and logic is limited, which can make it difficult for the model to generate text that requires a deep understanding of complex concepts or sophisticated arguments.

Note :

Perplexity is the standard evaluation metric for language models. Perplexity is the inverse probability of test set which is normalised by number of words in test set. Language models with lower perplexity are considered to better than ones with higher perplexity. Read this blog for more explanation on perplexity.

Conclusions :

The voyage and advances of the OpenAI GPT models over the course of three articles are summarised in this article. These models, which are unquestionably extremely powerful language models, have revolutionized the field of natural language processing by completing a wide range of tasks with only a few instances and a few instructions. Even though these models can’t understand natural language as well as humans do, they have undoubtedly shown how to get there.

References :

Primers * Generative Pre-trained Transformer (GPT)

Aman's AI Journal | Course notes and learning material for Artificial Intelligence and Deep Learning Stanford classes.

aman.ai

Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8), p.9.
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 (2020).
Rei, M., 2017. Semi-supervised multitask learning for sequence labeling. arXiv preprint arXiv:1704.07156.
Waswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Attention is all you need. In NIPS.