Motivation for this article is to share a few amazing feats achieved by Generative Pre-Training-3(NLP) model and discuss its journey.
Can Machines Build Code?
For many years, people have imagined ways to have a no-code approach to build applications. Finally, the dream is closer than ever. Debuild.co has built exactly that — a ML-powered development tool that takes in plain English, and outputs runnable code that powers web apps. Unbelievable as this claim may sound, it becomes more convincing when you look under the hood of the Machine Learning model that makes this code auto-generation possible.
Recently released GPT-3 model by OpenAI is the magician who makes this magic possible. We will look into how this model is built and how its meta-model approach is different than those employed by other NLP models. If you are not up-to speed on previous NLP Models, my writeup “Attention (Plus) Is All You Need” is a good short pre-read to be caught up.
What Else GPT-3 Can Do?
Since its beta release in June 2020, entrepreneurs & hobbyists have produced a plethora of new applications. Illustration of a few interesting implementations are below.
Generates Content Indistinguishable From Human’s
Uses Context For Chatbot Conversation
For more examples visit GPT3 example apps
If you are as excited as I am, you want to find out what makes GPT-3 so powerful. Below, I discuss how GPT evolved into a much powerful model — GPT-3.
Note : This write-up feels like a continuation of my previous blog, so if you have not read that by now — click here.
Yet Another Attention-Plus Model
GPT family of models use Transformer model design like the BERT & XLNet models did. Unlike BERT & XLNet models, which use encoder portion, GPT models use the decoder portion of the model. Before we go any further with the design and inner workings, let’s try to understand authors (Radford et al., 2018) motivation for building these models.
Note : Credits are given where due through hyperlinks to the source material. Specifically, most of the content refers to the three papers published by Open AI on GPT.
Generative Pre-Training (GPT)
Machine learning on NLP tasks typically involved gathering huge volume of data, labeling that dataset for one or more learning tasks (language translation , question answer, text entailment, etc. ) and applying a high capacity model on that data distribution. This trained, supervised model was then made available to public, who would further fine-tune this model to their dataset to use on one or more of its pre-trained original tasks.
Authors (Radford et al., 2018) in their paper characterized such models as narrow experts rather than competent generalists, since they only excel at the tasks they are trained for and cannot generalize to a new or a different task or to a change in data distribution.These models also have a huge appetite for data and need them labeled, which limits the extent of data that can be fed into them and labeled data is also scarce for many NLP tasks.
Motivation — How do you build a model that needs less labeled data and is task agnostic?
Authors (Radford et al., 2018) propose a semi-supervised model consisting two stages. The first stage is unsupervised pre-training of a high capacity language model on a large corpus of raw text. This stage is followed by a fine-tuning stage, where model is trained on specific NLP tasks with small labeled data.
In the first stage, the model is trained on a large text corpus of unlabeled data to predict the next word in the sentence. From previous model designs we know that bigger the text corpus, and longer the attention span (further out we have context for the word ), the better the prediction for the next word.
Intuition, therefore, for the first stage is that the model is learning the language and as it develops a better understanding it is able to learn discriminative features ,which become useful in the subsequent fine-tuning step.
In first stage, Model develops skills and pattern recognition abilities from raw unlabeled text.
Akin to the above intuition is when humans read and understand text, they are also able to know discriminative features such as answers to questions which can be posed of the text or the similarities or rationale used in the text, sentiment of the author etc.
In the second stage, the model is fine tuned using small labeled datasets on specific discriminative tasks. These tasks can include sentiment analysis, question answer, classification, similarity etc.
Intuition for the second stage is that the model is able to use learnings from the previous unsupervised step, expand on and apply those learnings to a specific discriminative task
In second stage, Model further develops those skills and applies them to the discriminative step.
Let’s look at few key take aways.
Design — Unsupervised Learning (Stage 1) : Model largely followed the design of Transformer (Vaswani et al.,2017), using only the 12 layer decoder portion with masked self-attention heads. Model was trained for 100 epochs of mini batches of 64 randomly sampled, contiguous sequences of 512 tokens.
Supervised Fine-tuning Discriminative Tasks (Stage 2): Model was trained on textual entailment (contradiction or neutral), question answering, semantic similarity and text classification tasks. Hyper-parameter settings from unsupervised step was largely used as-is, and 3 epochs of training was found to be sufficient for most cases.
Model was able to pre-train in unsupervised step and transfer the learnings to specific supervised discriminative tasks.
Discriminative Tasks —Previous models typically used task specific architectures (fine-tuned models) on top of generic models/learned representations. This introduced a lot of task specific customization and additional architecture components. Instead, in this model the data for different tasks were converted into ordered sequence using delimiter, start & extract tokens (fitting to its 512 contiguous input tokens) to avoid tasks specific customization for fine-tuning.
Fine tuning stage took data in a specific ordered format to avoid tasks specific customization in architecture
Layers Transferred — Authors (Radford et al., 2018) analyzed the impact of transferring variable number of layers from unsupervised pre-training stage to supervised tasks. They found that transferring embeddings improved performance by up to 9% on the target layer — indicating that each layer in the pre-trained model contains useful functionality for solving target tasks.
Each layer in the unsupervised pre-trained model contains useful functionality for solving target tasks
Zero Shots Learning — Authors (Radford et al., 2018) performed series of tests using generative model (Unsupervised learning stage)without the supervised fine-tuning (Second stage) step for variety of discriminative tasks. They found that the performance is stable and steadily increases with training, suggesting that generative pre-training stage learns wide range of task relevant functionality.
Generative pre-training stage learns wide range of task relevant functionality and possibly can be employed in a few or zero shot learning setting.
GPT improved state of the art performance on 9 out of 12 datasets used in the study.
Unsupervised Multitask Learners (GPT-2)
Task : Can Model be made more task agnostic?
Data : Can diverse & larger dataset increase pre-trained learning?
Layers : Can more layers and larger model size help?
Zero Shots : Can the unsupervised language model support zero shot learning (perform new tasks with no prior fine-tuning) for different discriminative tasks?
Motivation — Increase general methods of transfer learning from unsupervised language model to different downstream tasks in zero shot settings.
Task — In order to generalize the supervised tasks in GPT-1, the input data was arranged in a sequence (512 contiguous input tokens) specific to a task using delimiters, and then fed into the pre-trained model. The training objective of these supervised tasks can be expressed as p(output|input).
If the generic system has to adapt to different tasks (even for the same input) then it has to be conditioned to tasks along with their inputs and such a learning can be expressed as p(output|input, task). This meta model learning (learning for task adaptation) objective was adopted by the authors (Radford et al., 2019) in GPT-2 to make the system generic and adapt to new and different tasks.
By training on task along with the inputs, GPT-2 model became more generic and task agnostic.
Data — Most prior work trained language models on a single domain of text such as news articles, Wikipedia or fiction books. Authors (Radford et al., 2019) decided to use a diverse dataset as possible in order to capture as varied domains and contexts as possible. They scraped all outbound links from Reddit which had high user reviews, and built dataset out of those contents. This dataset, WebText, had 40 GB high quality data.
By training on data with diverse domains and contexts, the GPT-2 model transferred learning well across different domains and datasets.
Layers — Authors(Radford et al., 2019) in their study experimented with several model sizes. The smallest was of GPT-1 size — 12 layers, whereas largest (GPT-2 Model) had 4 times more layers than GPT-1 and 1542M parameters. Even with such a massive size and number of parameters, GPT-2 was found to under-fit WebText data, suggesting that more layers and parameters will likely improve the model’s performance.
With the increased capacity, GPT-2 model performed better on many downstream tasks on diverse domain datasets.
Zero Shots— By increasing the capacity of the model, making input data tasks agnostic and using a dataset (WebText) rich in quality and with diverse domains, Authors(Radford et al., 2019) observed that GPT-2 model learned to infer many of the tasks know-how from the language model sequences, and without any fine-tuning was able to perform at the state of art levels for downstream tasks.
Unsupervised language sequence models with sufficient capacity that are trained on rich data can learn to infer and perform tasks demonstrated in language sequences with little or no prior training.
GPT-2 zero-shots performance matched state of the art on 7 out of 8 tested language model datasets used in the study.
Few Shot Learners (GPT-3)
Meta-Learning : Can the model recognize patterns in the unsupervised stage use those abilities, without fine-tuning for a task, at the inference stage with a few or no task samples?
Data : Can pre-trained learning be further improved by adding quality data?
Under-Fitting : Can model perform better with more capacity?
Motivation — Rapidly adapt to or recognize the desired task in the inference stage with few or no data samples for the task.
Meta-Learning — Authors (Brown et al., 2020) built a 175 billion parameter model — GPT-3, a model 10 times larger than its predecessor GPT-2, and tested it for its meta-learning (in-context learning)abilities. They evaluated GPT-3 under three settings — Zero Shots, One Shot and Few Shots on more than a dozen NLP tasks, as well as on several novel tasks to test rapid adaptation to new tasks.
As the number of examples and model size increased, model accuracy increased as well.
Broadly, On NLP tasks GPT-3 showed promising results in the zero & one shot setting, and in the few shots setting sometimes even surpassed state of the art results. Also, gap in accuracy increases between One, Zero and Few Shots settings as model size increases, suggesting larger models are better meta-learners.
Larger models are more proficient meta-learners
Data — In an attempt to further improve on the quality and quantity of data for training, Authors (Brown et al., 2020) compiled a huge corpus of data using five different datasets, and within those, sampled high quality data more frequently and in more epochs.
With more high quality data and increased capacity, models can recognize patterns better in an un-supervised stage
Under-Fitting — To study the dependence of ML performance on model size, (Brown et al., 2020) trained 8 different model sizes, ranging from 125 million to 175 billion parameters.
The accuracy of the model for a few shots, one shot, and zero shots increases with the scale of the model as shown below
Meta-Learning show strong gains with scale
I can imagine several use-cases for which I can employ GPT-3, but unfortunately, unlike other models, GPT-3 is not open-sourced and available to public.
How can you build your applications using GPT-3?
GPT-3 is available through an HTTPS api call. GPT-3 api is in its beta version and Open AI approval is needed to get access to the api. Unfortunately approval has a wait list, and I have been on that list for sometime now.
To get on the waitlist — click here.
As I bid adieu to year 2020 and reflect on the year, I think there was at-least two profound ML breakthroughs — Unfolding of Protein by Deep Mind & GPT-3 by OpenAI — that showed us what ML is capable of and how much more it can do in years to come.