OpenAI’s GPT — Part 1: Unveiling the GPT Model

Introducing the GPT language model by OpenAI

Published in

VisionWizard

8 min readAug 21, 2020

Language modeling in NLP tries to solve a number of tasks such as text summarization, speech recognition, Optical Character Recognition (OCR), machine translation, part-of-speech tagging, and many more.

Leveraging the task of natural language understanding for language models, OpenAI proposed a GPT model in the paper “Improving Language Understanding by Generative Pre-Training (2018)” that achieves state-of-the-art results in 9 out of 12 NLP tasks.

In this series, I will introduce you to the language models GPT, GPT-2, and GPT-3 proposed by OpenAI.

The flow of the series will be as mentioned below.

1. OpenAI’s GPT — Part 1: Unveiling the GPT Model
2. OpenAI’s GPT — Part 2: Unveiling the GPT-2 Model
3. OpenAI’s GPT — Part 3: Introduction to GPT-3
4. OpenAI’s GPT — Part 4: Deeper Insights into GPT-3
5. OpenAI’s GPT — Part 5: Implementation of GPT-3

1. Background

The task of obtaining large manually labeled data is not feasible as it is expensive and there is a lack of labeled data. On the other hand, text corpora of unlabeled data are present in abundance.

Therefore, to make the models independent of the supervised learning approach it becomes crucial to extract valuable information from the unlabeled and the labeled data with the help of semi-supervised learning.
Presently, not more than word-level or sentence-level information can be effectively obtained from the unlabeled text. To generate information from unlabeled data for more complex tasks, the following two challenges are faced.

1. Lack of clarity about the type of optimization objectives that will be effective in learning for the tasks requiring transfer.
2. No technique has received a common consensus as the most effective technique for transferring the learned representations to the target task.

Moreover, some of the existing techniques require making changes to the architecture based on specific tasks. Hence, no model was developed that could provide universal representation for a varied range of tasks.

2. Introduction

A transformer-based architecture is chosen for the GPT model as it can provide long-range language inferences, unlike LSTM which was used in the previous similar work on text classification.
Due to its ability to model long-range dependencies, the transformer also performs significantly well on NLP tasks like document generation, machine translation, and syntactic parsing.

Figure 2: Transformer model architecture (Source: Link)

The proposed model [1] employes a semi-supervised learning approach with the usage of unsupervised pre-training and supervised fine-tuning.
First, to learn the parameters of a neural network, a language modeling objective is used on the unlabeled data. Second, a supervised objective is used for adapting the learned parameters to a particular target task.
The model is termed as “task agnostic” as it is not specific to a single NLP task but rather provides a generalizable architecture that can suit different NLP tasks with minimal changes for transfer.

3. Proposed Framework

The training is performed in two stages — unsupervised pre-training and supervised fine-tuning.

3.1 Unsupervised pre-training

The first stage involves learning a model on a large corpus of text.
Given the input corpus of tokens, U = {u1,…,un}, using the language modeling objective, the log probability of u_i with respect to the previous tokens is maximized.
In the objective function, P is the conditional probability, k refers to the size of the context window and Θ are parameters of a neural network.

Figure 3: Unsupervised pre-training procedure

The attention and output probability distribution is calculated as follows.

Figure 4: Attention weights and output probability distribution for the transformer decoder (Source: [1])

U is input corpus of tokens, W_e refers to token embedding matrix and W_p for position embedding matrix.

To know more details about the different transformer components, visit here.

3.2 Supervised fine-tuning

The second stage involves fine-tuning, i.e. making the model adaptable to certain tasks.
A labeled dataset C is taken as input. Probability of label y given tokens x_1,…,x_m is obtained after passing data through the pre-trained model and further layers.
Further, a maximizing objective is applied using the log probability.
Lastly, when unsupervised pre-training was done before fine-tuning, several advantages like improved generalization and accelerated convergence were witnessed. Therefore, a combination of both the objective functions using weight λ was considered.

Figure 5: Supervised fine-tuning procedure

3.3 Task-specific input transformations

To fine-tune the proposed GPT model for certain tasks like textual entailment, semantic similarity, question answering and commonsense reasoning, certain transformations need to be done to the input data structure so that the model can process.

3.3.1 Textual Entailment

Premise p and hypothesis h token sequences are concatenated with a delimiter ($) in between.

Figure 6: Input transformation for textual entailment (Source: [1])

3.3.2 Semantic Similarity

Two possible sentence orderings are processed independently with both the texts containing a delimiter between them.

Figure 7: Input transformation for semantic similarity (Source: [1])

3.3.3 Question Answering and Commonsense Reasoning

Input comprises a document, a question, and a collection of possible answers. Transformation is done by concatenating the document with the question for each possible answer in the set and placing a delimiter between them.

Figure 8: Input transformation for question answering and commonsense reasoning (Source: [1])

The overall proposed framework is as shown below.

Figure 9: (Left) unsupervised pre-training and (right) input transformations for supervised fine-tuning (Source: [1])

4. Experimental Details

4.1 Unsupervised pre-training

The language model is trained on the BooksCorpus dataset. It is a huge dataset that contains 7000 unique unpublished books.
The books are part of various genres such as adventure, fantasy, and romance.
This dataset forms a perfect fit for the language model as it provides long contiguous texts that can help the model to learn to utilize the long-range information.

Specifications of some of the parameters chosen for training are as mentioned below.

— 12 layer decoder-only transformer with masked self-attention heads.
— Adam optimizer with a maximum learning rate of 2.5e-4.
— Training was done with 100 epochs with mini batches of size 64 on randomly sampled contiguous sequences of 512 tokens.
— Gaussian Error Linear Unit (GELU) activation function was used.

For detailed information about the experimental setup, refer to section 4 of [1].

4.2 Supervised fine-tuning

Dropout with a rate of 0.1 is added to the classifier.
The learning rate is taken as 6.25e-5 and batch size is equal to 32.
With 3 epochs of training, the model in most cases learns quickly.
The linear learning rate decay schedule is used with a 0.2% warmup rate for training.

5. Results and Inferences

Results from the supervised fine-tuning and comparisons of the proposed model in different scenarios are described below.

Fine-tuning was performed on the following variety of four supervised tasks — natural language inference, semantic similarity, question answering, and text classification.

5.1 Natural Language Inference

Through this task, the relationship between the input texts is analyzed in terms of entailment, contradiction, or neutral.
The model is evaluated on the five datasets from diverse sources — image captions (SNLI), science exams (SciTail), Wikipedia articles (QNLI), transcribed speech, popular ﬁction, and government reports (MNLI) and news articles (RTE).
The proposed model outperforms the previous state-of-the-art methods on four out of the five given datasets. An improvement of 1.5% on MNLI, 5.8% on QNLI, 5% on SciTail, and 0.6% on SNLI can be witnessed.

Figure 10: Experimental results for natural language inference task in comparison with current state-of-the-art models. (Source: [1])

5.2 Question Answering and Commonsense Reasoning

In this task of question answering, prediction of the answer is done on the basis of the document text provided and the question asked.
RACE dataset which comprises English passages and questions related to school exams is used for the evaluation. This dataset is particularly beneficial because it contains a large number of reasoning type questions as opposed to datasets like CNN or SQuAD.
The model is also evaluated on the Story Cloze Test, wherein endings to multi-sentence stories are chosen from two options.
Model’s success in handling long-range information effectively can be witnessed by an improvement of 8.9% on Story Cloze, and 5.7% overall on RACE over the previous best methods.

Figure 11: Experimental results for question answering and commonsense reasoning in comparison with current state-of-the-art models. (Source: [1])

5.3 Semantic Similarity and Classification

Semantic similarity is a challenging task that involves checking for equivalence of two sentences based on semantics.
Three datasets are used for the evaluation of this task — the Microsoft Paraphrase corpus (MRPC), the Quora Question Pairs (QQP) dataset, and the Semantic Textual Similarity benchmark (STS-B).
New results surpass the previous state-of-the-art results on two of the three semantic similarity tasks.

Figure 12: Experimental results for semantic similarity and classification in comparison to current state-of-the-art methods. (Source: [1])

For classification, Corpus of Linguistic Acceptability (CoLA) is taken as an evaluation task that tests the grammatical correctness of a sentence.
Stanford Sentiment Treebank (SST-2), a standard binary classification task, is also used for evaluation.
On CoLA, the model obtains a score of 45.4 whereas with SST-2 the accuracy obtained is 93.2%. [See figure 12]
The model significantly improves on the GLUE benchmark with an overall score of 72.8. [See figure 12]

5.4 Ablation Studies

First, the effectiveness of the auxiliary LM objective is tested. Results indicate that it benefits larger datasets and not smaller datasets.
Second, the effect of using a transformer over a single layer 2048 unit LSTM is observed. An average score drop of 5.6 is witnessed in the latter case.
Finally, the effectiveness of performing pre-training is evaluated. Without pre-training, the performance decreases by 14.8% across all the tasks.

Figure 13: Results of ablation studies (Source: [1])

6. Conclusion

A framework that improves natural language understanding with the help of unsupervised pre-training and supervised fine-tuning is proposed.
The proposed task agnostic model generalizes well to a number of tasks such as natural language inference, semantic similarity, question answering, and text classification.
The significant improvement in performance over 9 out of the 12 datasets shows the potential of this model in improving natural language understanding with unsupervised learning in future tasks.

7. References

[1] Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018): 12.

[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

huggingface/transformers

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers (formerly known as…

github.com