GPT-3: Whats, Hows & The Takeaways

Published in

Analytics Vidhya

7 min readJul 28, 2020

When I first heard about GPT-3, my first impression was that it must be GPT-2 + more compute + more data. This isn’t a bad expectation given that GPT-2 itself is GPT + more compute + more data + few smart hacks. Turns out, this is true with GPT-3 too. But, this doesn’t undermine the feat GPT-3 and its predecessors have achieved over the past 3 years. Blindly feeding in more to the Language Models isn’t beneficial after a certain limit and would attract a host of engineering/data/compliance challenges to deal with.

This post discusses the following:

How GPT-3 is a commendable attempt in achieving generality in the text-based tasks and a hopeful contribution towards AGI using novel learning paradigms — Transformer Architectures and In-context Learning(Meta-Learning)
Adapting some of the best practices and cues from the literature that would come in handy for ML practitioners working on Language Modelling for practical applications

Thoughts on GPT-3

Looking beyond the GPT nomenclature into the names of the corresponding white papers, we get a sense of how GPT models have progressed over time and their contributions compared to the predecessors.

Improving Language Understanding (GPT) — 117M parameters — ~400MB in size
Language Models are Unsupervised Multitask Learners (GPT-2)— 1.5B parameters — ~5GB in size
Language Models are Few-Shot learners (GPT-3) — 175B parameters — ~500GB in size

Surprisingly (or not?), the core of the learning component remained same across all iterations — A unidirectional Language modelling objective despite the proven recent enhancements to it (for example BERT)

Learning joint probabilities over symbols as a product of conditional probabilities

Quoting the architectural changes for the sake of completeness:

GPT to GPT-2

The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/ √ N where N is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

GPT-2 to GPT-3

Our basic pre-training approach, including model, data, and training, is similar to the process described in GPT-2, with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to GPT-2, but in this work we systematically explore different settings for learning within the context. We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

From the usability standpoint:

GPT-2 has focused on Zero-shot setting where the Language Model’s ability to perform certain tasks is tested directly after optimizing for the stated Language Modelling objective
GPT-3 has focused on more practical and scalable paradigms — One-shot and Few-shot settings along with the zero-shot. This has enabled GPT-3 to top the SOTA charts by significant margins compared to most task-specific fine-tuning models

More on Meta-learning, zero-shot, few-shot learning in a separate post.

GPT-3 and Generalization

At high-level, task-specific learners are optimized for p(output|input) and the generalization is achieved by optimizing for p(output|input, task). Recent architectures achieve this by specifying input, output and task specifications all as a sequence of symbols. For example, a translation request can be formulated as (translate to french, english text, french text) and a reading comprehension task as (answer the question, document, question, answer) — In-context learning

Aggregated performance of GPT-3 on 42 labelled tasks — Demonstrating its In-context learning capabilities

GPT-3 builds upon the recent claims that sufficiently large transformer architectures improve the text synthesis and downstream performance across multiple tasks and the evidence that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale. GPT-3 hypothesized and proved that Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

Zero-shot, one-shot and Few-shot illustrations showing inference time behaviour of GPT-3

During inference, the model expects a set of examples (context, sample input and the outputs) to bootstrap the model into the current problem setting
GPT-3 uses these examples to infer on the new query input without back-propagating the gradients in the process
The inference time behaviour of the model depends on the problem type
For example, On tasks that involve choosing one correct completion from several options, we provide K examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing P(completion|context)/P(completion|answer context), where answer context is the string “Answer: “ or “A: “ and is used to prompt that the completion should be an answer but is otherwise generic.

Takeaways for ML Practitioners

For Potential GPT-3 Users

While it’s not feasible to put an extensive list of cases where GPT-3 can and cannot be used, it's worth considering these pointers while taking the call. These suggestions are based on the assumption that we are not going to get access to the GPT-3 feature vectors of the input text anytime soon unless they open-source the whole GPT-3 model.

If your use case has a lot to do with free text available over the internet and/or the datasets used for GPT-3 training are related to your use cases, the use case is similar to those tasks where GPT-3 shows considerable results, you can give it a try
If your use case requires open + proprietary data and the use cases are aligned with the GPT-3 setup, the GPT-3 usage might be limited or case-specific
For the cases driven entirely by proprietary data OOTB GPT-3 might not of much practical help

For ML Engineers and hackers

The key takeaway for me is that the core of GPT-3 being classical Language modelling objective and yet beating the SOTA charts with little/no supervision is really encouraging

When you have enough data to fine-tune GPT like architectures with Language Model objectives, we can achieve phenomenal results on custom data sets
LM adaptation needs carefully babysitting the fine-tuning process and identifying unsupervised auxiliary tasks that boost the fine-tuned model performance
It enormously helps to establish the correlation between log loss and the target use cases. As observed in GPT-3 and other recent papers, log loss is correlated with most text downstream tasks — this should cover a majority of problems that we address today
For the use cases where the data is not easily accessible for human inspection, blind fine-tuning of Language Model by observing the log loss and performance on auxiliary tasks is a viable sniff test and a way to leverage the data rather settling to using the OOTB language models on restricted data. For more help on picking the auxiliary tasks and reference papers, check the attached posts (a bit dated, but should be a good starting point)
Building a reliable feature extractor is key to most text tasks where labelled data is sparse. GPT-3 and its predecessors outlines how this can be achieved. Refer to GPT paper for the approach and GPT-2, GPT-3 for useful hacks and improvements

[Part-1] Which Attention(architecture) do you need?

Overview of recent advances in Transformer architectures for NLP tasks

medium.com

[Part-2] Which Attention(architecture) do you need?

Overview of recent advances in Transformer architectures for NLP tasks

medium.com

Conclusion

While GPT-3 comes with its own set of limitations has a lot of scope for improvements, it’s a phenomenal feat in terms of model size, training data size and smart training strategies. While fine-tuning and handling such large scale models is not yet in the reach of most ML practitioners, the key takeaway is that GPT-3 and its predecessors reinforce the fact that the fundamental Language modelling paradigm in itself sufficient for most downstream tasks where availability of labelled dataset is a challenge.

Happy to discuss if you have any use cases that could be addressed with Language Models and fine-tuning.

Happy Learning!!

GPT-3: Whats, Hows & The Takeaways

Thoughts on GPT-3

GPT-3 and Generalization

Takeaways for ML Practitioners

For Potential GPT-3 Users

For ML Engineers and hackers

[Part-1] Which Attention(architecture) do you need?

Overview of recent advances in Transformer architectures for NLP tasks

[Part-2] Which Attention(architecture) do you need?

Overview of recent advances in Transformer architectures for NLP tasks

Conclusion

Written by Venkata Dikshit