Understanding GPT-3: OpenAI’s Latest Language Model

Published in

The Startup

8 min readSep 1, 2020

1. Introduction

If you have been following the recent developments in the NLP space, then it would be almost impossible to avoid the GPT-3 hype in the last few months. It all started with researchers in OpenAl researchers publishing their paper “Language Models are few Shot Learners” which introduced the GPT-3 family of models.

GPT-3’s size and language capabilities are breathtaking, it can create fiction, develop program code, compose thoughtful business memos, summarize text and much more. Its possible use cases are limited only by our imaginations. What makes it fascinating is that the same algorithm can perform a wide range of tasks. At the same time there is widespread misunderstanding about the nature and risks of GPT-3’s abilities.

To better appreciate the powers and limitations of GPT-3, one needs some familiarity with pre-trained NLP models which came before it. Table below compares some of the prominent pre-trained models:

Source: https://medium.com/dataseries/bert-distilbert-roberta-and-xlnet-simplified-explanation-and-implementation-5a9580242c70

Let’s look at some of the common characteristics of the pre-trained NLP models before GPT-3:

i) NLP Pre-Trained models are based on Transformer Architecture

Most of the pre-trained models belong to the Transformer family that use Attention techniques;These models can be divided into four categories:

ii)Different models for different tasks

The focus has been on creating customized models for various NLP tasks. So we have different pre-trained model for separate NLP tasks like Sentiment Analysis, Question Answering, Entity Extraction etc.

iii). Fine-tuning of pre-trained model for improved performance

For each task the pre-trained model needs to be fine-tuned to customize it to the data at hand. Fine-tuning involves gradient updates for the respective pre-trained model and the updated weights were then stored for making predictions on respective NLP tasks

iv) Dependency of fine-tuning on large datasets

Fine-tuning models requires availability of large custom labeled data. This has been a bottleneck when it comes to extension of Pre-trained model to new domains where labeled data is limited.

v) Focus was more on architectural improvements rather than size

While one saw the emergence of new pre-trained models in a short span, the larger focus has been on bringing architectural improvements or training on different datasets to widen the net of NLP applications.

2. GPT-3 -Quick Overview

Key facts about GPT-3:

Models: GPT-3 has eight different models with sizes ranging from 125 million to 175 billion parameters.
Model Size: The largest GPT-3 model has 175 billion parameter. This is 470 times bigger than the largest BERT model (375 million parameters)
Architecture: GPT-3 is an Autoregressive model and follows a decoder only architecture. It is trained using next word prediction objective
Learning: GPT-3 learns through Few Shots and there is no Gradient updates while learning
Training Data Needed: GPT-3 needs less training data. It can learn from very less data and this enables its application on domains having less data

Sizes, architectures, and learning hyper-parameters of the GPT 3 models

Source: https://arxiv.org/abs/2005.14165

Key design assumptions in GPT 3 model:

(i) Increase in model size and training on larger data can lead to improvement in performance

(ii) A single model can provide good performance on a host of NLP tasks.

(iii) Model can infer from new data without the need for fine-tuning

(iv) The model can solve problems on datasets it has never been trained upon.

3. How GPT-3 learns

Traditionally the pre-trained models have learnt using fine-tuning. Fine-tuning of models need lot of data for the problem we are solving and also required update to model weights. The existing fine-tuning approach is explained in below diagram.

Learning Process for earlier Pre-Trained Language Models — Fine Tuning

GPT-3 adopts a different learning approach. There is no need for large labeled data for inference on new problems. Instead, it can learn from no data (Zero Shot Learning), just one example (One Shot Learning) or few examples (Few Shot Learning).

Below we have shown a representation of different learning approach followed by GPT-3.

4. How is GPT-3 different from BERT

BERT was among the earliest pre-trained model and is credited with setting the benchmarks for most NLP tasks. Below we compare GPT-3 with BERT on three dimensions:

Things which stand out from above representation are:

GPT-3 size is the stand out feature. It’s almost 470 times the size of largest BERT model
On the Architecture dimension, BERT still holds the edge. It’ s trained-on challenges which are better able to capture the latent relationship between text in different problem contexts.
GPT-3 learning approach is relatively simple and can applied on many problems where sufficient data does not exist. Thus GPT-3 should have a wider application when compared to BERT.

5. Where GPT-3 has really been successful

Application of NLP techniques have evolved with the progress made in learning better representation of the underlying text corpus. Below chart gives a quick overview of some of the traditional NLP application areas.

Traditional NLP built on Bag of Words approach was limited to tasks like parsing text, sentiment analysis, topic models etc. With the emergence of Word vectors and Neural Language models, new applications like Machine Translation, entity recognition, information retrieval came into prominence.

In last couple of years, the emergence of Pre-trained models like BERT, Roberta etc. and supporting frameworks like Hugging Face, Spacy Transformers have made NLP tasks like Reading Comprehension, Text Summarization etc. possible and state of the art benchmarks were created by these NLP models.

The frontiers where pre-trained NLP models struggled were tasks like Natural Language Generation, Natural Language Inference, Common Sense Reasoning tasks. Also, there was question mark of application of NLP in these areas where limited data is available. So the question is how much impact is GPT-3 able to make on some of these tasks.

GPT-3 has been able to make substantive progress on i. Text Generation tasks and ii. Extend NLP’s application into domains where there is lack of enough training data.

Text Generation Capabilities: GPT-3 is very powerful when it comes to generating text. Based on the human surveys done, it has been observed that very little separates the text generated by GPT-3 compared to one developed by humans. This is great development for building solutions in the space of generating creative fictions, stories, resumes, narratives, chatbots, text summarization etc. At the same time the world is taking cognizance of the fact that this power can be used by unscrupulous elements to create and plant fake data on social platforms.

Source:https://old.reddit.com/r/slatestarcodex/comments/hmu5lm/fiction_by_neil_gaiman_and_terry_pratchett_by_gpt3

Build NLP solutions with limited data: The other area where GPT-3 models have left a mark are domains where limited data is available. We have seen the open source community use GPT-3 API’s for tasks like generation of UNIX Shell commands, SQL queries, Machine Learning code etc. All that users need to provide is a task description in plain English and some examples of input/output. This can have huge potential for organizations to automate routine tasks, speeding up processes and focus their talent on higher value tasks

Source: https://vimeo.com/427943407/98fe5258a7

6. Where GPT-3 struggles

We have seen GPT-3 being able to make substantive progress on Text Generation tasks and also extend NLP’s applications to domains that have limited data available. However how does it fare on the traditional NLP tasks like Machine Translation, Reading Comprehension, and Natural Language Inference tasks. It’s a mixed bag and this is clearly documented in the original research paper.

Language Modeling — GPT-3 beat all the benchmarks on pure Language Modeling tasks.
Machine Translation — The model is able to beat the benchmark performance for translation tasks that require conversion of documents to English language. But the opposite is not true and GPT-3 performance struggles if the language translation needs to be done from English to a Non-English Language.
Reading Comprehension — GPT 3 models performance fall well short of the State of the Art here.
Natural Language Inference — Natural Language Inference (NLI) concerns the ability to understand the relationship between two sentences. GPT 3 models performance fall well short on NLI tasks
Common Sense Reasoning — Common Sense Reasoning datasets test performance on physical or scientific reasoning skills. GPT 3 models performance fall well short on these tasks

7. Road Ahead

Integration Challenges: At the moment GPT-3 has been made available to selected users using Open AI’s api and the user community is happy to build toy applications using GPT-3. Many firms especially in the Financial world have regulations that prohibit transfer of data outside acceptable the firm. Given GPT-3 size, if the model needs to be integrated into mainstream applications, it will be a herculean effort to develop the necessary infrastructure to ingest the data and model.

Single versus Hybrid Model Debate: The dream of having a single model for all tasks, which does not need to be trained and can learn without much data, is a cherished one. GPT-3 has made the first steps towards achieving it, but there is a journey that still needs to be made. In its present form, GPT-3 is a mixed bag. Organizations will have to choose horses for courses approach. One possible approach is to use GPT-3 for tasks like Text Generation, Machine Translation and areas where limited data exists. However, the existing pre-trained customized models (BERT/Roberta/Reformer….) will continue to hold ground on traditional tasks like Entity Recognition, Sentiment Analysis, Question Answering.

Concerns over Model Bias and Explainability: Given the sheer size of GPT-3 it will be very difficult for firms to explain the decisions made by the algorithm. There is no way for firms to regulate the data that was used to train the algorithm. How do we know if the training data has in built bias or the algorithm is making its decisions based on false data that has been put in public domain? This is further complicated by the ambiguity over how the actual decision is made just by observing a few examples in few shot learning

Need for regulations to prevent misuse: There are some very valid concerns being raised about misuse of GPT-3 powers if not properly regulated. How the AI community is able to come up with regulation that prohibits the misuse will govern to a large extent how the model gets accepted in the organizations where there is increased awareness on Responsible AI.