Transformers and GPT-3

Published in

TechVariable

4 min readOct 22, 2020

A short insight on the mighty Generative Pre-Trained Transformer-3

Let me begin the entirety with a short introduction, a Transformer in simple terms is a deep learning model and also, the first transduction blueprint solely relying on self-attention to compute representations without using sequence-aligned Recurrent Neural Networks (RNNs).

The term ‘Transduction’ means the conversion of all the input sequences into output sequences. This novel architecture in Natural Language Processing (NLP) aims to solemnly solve all sequence-to-sequence problems while easily handling long-range dependencies at the same time. Therefore, these transformers have proved to be better than any other traditional NLP models so far in terms of performance and speed.

Now, when we speak of Open AI’s latest and largest model of GPT-3 with a humongous capacity of around 175B parameters, popularly known as the Generative Pre-trained Transformer. It has a remarkable capability to leverage deep learning and generate human understandable texts. Along with text-generation, it also has the ability to generate codes, poems, essays, stories and what not to just begin with!

GPT-3 has surpassed the previous record of the language model (LM) GPT-2 of about 1.5B parameters. While the GPT-2 was able to produce a convincing stream of tests with different styles when prompted with an opening sentence, GPT-3 is an advanced giant leap with potential beyond words. In general, the more parameters a model has, more data is required to train the model. As per the creators, the OpenAI GPT-3 model has been trained about 45TB of text data from multiple sources which include numerous sources like Wikipedia and books.

And in terms of architecture, the significant change to be noted from GPT-2 to GPT-3 are as follows:

The presence of additional decoder layers for each model and rich dataset.
Application of Parallelism to every layer of the model as well as for matrix multiplication.
Also, the usage of a sparse transformer for managing memory not being used in the prior model of GPT-2.

OpenAI GPT-3 can perform enormous tasks with a very few or no examples/demonstration (shots as they are better known). Before we dive into the numbers, let’s first understand the concept of the following types of tasks with respect to the model:

In the Few-Shot (FS) setting, we need to give multiple examples along with task description and then prompt a question similar to training a machine learning model where we need to give some inputs and corresponding outputs to a model and then expect the model to perform for the unseen input. However, the difference here is that unlike a normal ML algorithm, the model does not do any weight updates, except it just infers on the basis of the number of ‘Shots’ it has been fed.
One-Shot (1S) setting is the same as the Few-Shot except that only one example is being fed to the model in addition to the last context (which is the task). In One-Shot, we give one example along with the task description and then prompt a question to GPT-3.
Zero-Shot (0S) is when we describe the task to GPT-3 followed by our question in the prompt. Then, the GPT-3 tries to understand the description and give us the solution for the prompt. This kind of setting is pretty tough as it becomes difficult for even humans at times to understand what the task is with no example or demonstration.

Even GPT-3 has been found with some limitations in terms of answering questions relating to judgemental natured questions of the type “If I keep my food in the refrigerator for about a month, will it still remain fresh?”. Also, the model being bidirectional in nature, performs worst in tasks including fill in the blanks, say with a long passage and then generating a short answer such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks.

Another limitation associated with models at the scale of GPT-3 is that they are both expensive and inconvenient to perform inference on, regardless of the objective function or algorithm, which presents a challenge for practical applicability of models of this scale in their current form.

This mega GPT-3 is here to stay and for good in the long run. All that is expected from this powerful model is more advancement on its fine-tuning feature, removing overlaps between test and train data and also, its distillation of large models down to a manageable size for specific tasks.

However, making a bidirectional model at the scale of GPT-3, and trying to make bidirectional models work with few or Zero-Shot learning, is much awaited and a promising direction for future research.

Thanks for going through my blog. I would love to connect with you on LinkedIn.:)

Publication: TechVariable

Transformers and GPT-3

Written by Sonali Saikia