Autoregressive Models for Natural Language Processing

Zain ul Abideen
7 min readJun 26, 2023

--

The Evolution of GPT: From GPT to GPT-2 to GPT-3

Introduction

In this blog post, I will be discussing the Autoregressive models. The models I will be discussing have been developed by OpenAI. First I will cover the basics of Autoregressive models which are common in all GPT models and then move toward the advancements in each successive model. In the previous blog post Attention Is All You Need: The Core Idea of the Transformer, I discussed the self-attention mechanism and Transformer architecture. In this blog, I will be building upon the previous information. So if you haven’t checked out the previous post, go check it out. Transformer architecture uses 6 encoders and 6 decoders as mentioned in the original paper. All the GPT models have an architecture consisting of only decoders. Multiple decoders are stacked one above the other with a linear layer at the end. First of all, let me explain what autoregressive models are:

Autoregressive Models

Autoregressive models are a type of statistical or machine learning model that predicts the next value in a sequence based on the previous values in that sequence. These models assume that the future values in the sequence are dependent on the past values and use this dependency to make predictions. In the context of natural language processing, autoregressive models are often applied to generate text or make predictions based on previous words in a sentence. These models learn the statistical patterns and dependencies in the training data and then use that knowledge to generate coherent and contextually relevant text. Autoregressive language models, such as GPT (Generative Pre-trained Transformer), GPT-2, and GPT-3, have gained significant attention for their ability to generate high-quality text and perform a variety of language-related tasks.

Semi-supervised Learning

All the autoregressive models by OpenAI have used the approach of semi-supervised learning. This approach is a mixture of supervised and unsupervised learning. As the cost of making labeled datasets for language tasks is quite high because they require professionals. OpenAI came up with an approach of unsupervised pre-training and supervised fine-tuning. Their training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where they adapt the model to a discriminative task with labeled data.

Unsupervised pre-training:

In unsupervised pre-training, we have an unlabeled corpus of text and our objective is to maximize the log-likelihood of next word while given previous words. We are using the concept of conditional probability here. Unidirectional pretraining is performed here.

Objective for pre-training

We are taking k as the context window. In simple words, we can look back at k tokens while predicting (k+1)th token. During the pre-training, a multi-layer Transformer decoder is used. Multi-headed self-attention operation is applied over the input token followed by a position-wise feed-forward network. The output is a distribution over the target tokens.

Supervised fine-tuning:

In supervised fine-tuning, we have a labeled dataset ‘y’ as labels and ‘x’ as inputs. The inputs are passed through the pre-trained model and the output from the final transformer block is fed into an added linear output layer with parameters Wy to predict y. This gives us the following objective to maximize:

Intermediate objective

It was found that using the pre-training objective as an auxiliary objective while fine-tuning improves generalization and accelerates convergence. So L1 was made part of the final objective with weight.

Final objective for fine-tuning

Now, I will be explaining the models developed by OpenAI based on semi-supervised learning.

Generative Pre-Trained Transformer (GPT)

GPT model consists of a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, 3072-dimensional inner states were used. Adam optimizer, Byte pair encoding, and Gelu optimizer were used. Pre-training of the model was done on large BookCorpus dataset (7000 books). It has a total of 110 million parameters.

GPT

The left part of the above figure shows the GPT model architecture. We can view that there are two training objectives. On the right-hand side, we have different transformations applied to input sequence for different fine-tuning tasks.

Analysis of various model ablations on different tasks

GPT-2

GPT-2 model has a total of 1.5 Billion parameters. It is a unidirectional model i.e. trained to predict the next word in sentence. Major changes in GPT-2 in comparison to GPT-1 is that GPT-2 is much larger model (the larger the better). It is trained on a much larger unlabeled dataset (the larger the better). No fine-tuning is done in GPT-2. They have introduced the concept of zero-shot transfer. Zero-shot transfer refers to the scenario where a pre-trained model is directly applied to a new task or domain without any additional training on task-specific data. The model leverages its learned knowledge from the pre-training stage to make predictions or generate outputs for the new task. The key characteristic of zero-shot transfer is that the model hasn’t seen any examples from the target task during training or fine-tuning. Instead, it relies on its general understanding of the language or the patterns learned from the pre-training data.

Zero-shot transfer performance of GPT-2

The architecture of GPT-2 is almost similar to GPT-1. It is based on the original Transformer decoder. There is only a small rearranging of layer norm and residual layers. The vocabulary size has been increased from 30,000 to 50,257. The context size has been increased from 512 to 1024 tokens. The model has been trained on WebText (millions of pages) and 40GB of Reddit posts. More emphasis has been put on dataset quality.

GPT-3

GPT-3 model has a total of 175 Billion parameters. It is also a unidirectional model. Major change in GPT-3 in comparison to GPT-2 is that GPT-3 is a much larger model (the larger the better). It is trained on a much larger unlabeled dataset (the larger the better). They have introduced the concept of few-shot transfer. Few-shot transfer is similar to one-shot transfer but involves training the model with a few additional examples rather than just one. The model is fine-tuned on a small labeled dataset specific to the target task or domain, typically consisting of a few examples per class or category. This additional training helps the model to further adapt its representations and parameters to the task’s requirements. The architecture of GPT-3 is almost similar to GPT-2. They have used the Attention pattern from Sparse Transformer. The context size has been increased from 1024 to 2048 tokens. Larger word embeddings have been used (12.8k instead of 1.6k).

In-context learning

While performing the unsupervised pre-training (outer loop), it was observed that the model implicitly performed in-context learning from the rich information in the text. The model develops a broad set of skills and pattern recognition tasks. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task. In-context learning is the term used to describe the inner loop of this process, which occurs within the forward pass upon each sequence. It has also been observed that larger models make increasingly efficient use of in-context information.

Datasets used to train GPT-3

Closing Remarks

In conclusion, autoregressive models such as GPT, GPT-2, and GPT-3 have revolutionized the field of natural language processing and demonstrated the power of large-scale language models. These models have showcased remarkable capabilities in generating coherent and contextually relevant text, pushing the boundaries of what is possible in language generation tasks. However, it is crucial to be mindful of the ethical considerations and potential biases associated with autoregressive models. In the next blog post, I will be covering in detail other large language models l like BERT, BART, and T-5.

Thank you for reading!

Follow me on LinkedIn!

--

--