Understanding the Technical Architecture of GPT-3 and How it was Made

Bagavan Marakathalingasivam
Geek Culture
Published in
3 min readJun 14, 2021


Deep learning is revolutionizing the world, whether from teaching computers to drive to even using them to assist doctors! But when it comes to deep learning, the biggest factor for this success is the vast amount of labelled data while training.

However, this still restricts applicability in many areas relating to natural language processing (NLP — the linguistics side of AI). This is mainly because there is a lot more unlabelled data than it is labelled. Now, annotating this unlabelled data could solve this problem, but that’s extremely time-consuming and costly.

Being able to learn from unlabelled data is also known as unsupervised learning, and being able to train a model this way will give a huge performance boost to it; however, there are many challenges when trying to create models through this method:

  1. It’s unclear what type of optimizer for text representation is transferable to other domains.
  2. There is no consensus on the most effective way to transfer the learning representation.
  3. Even after finding a way to perform this, the model will still require many changes to its architecture which defeats the whole purpose.

So there’s no way we could do this, right?…. Wrong.

Introducing GPT (Generative Pre-Training)

GPT’s approach to solving this problem is using something known as semi-supervised learning (a mix between supervised and unsupervised learning) to create a model or language understanding.

There are two main steps to this process:

The first step is to use an unsupervised pre-training method to create a broader language understanding model. In contrast, the second part will use supervised fine-tuning to fit the model into a specific task.

By doing so, GPT will learn a universal representation of text that is both transferable and requires little adaption to fit a wide range of tasks.

How will it do this?

GPT’s Framework — Transformers

For GPT to do this, it will use a transformer model architecture to perform strongly on a variety of tasks. The transformer model creates more structure memory for longer-term dependencies (compared to recurrent neural networks, which are mainly used for short-term dependencies), allowing for minimal changes in the architecture when performing a specific task.

As I mentioned previously, we will both need to use unsupervised pre-training and supervised fine-tuning, so let’s get a deeper look into how we will do this.

Unsupervised pre-training

The unsupervised pre-training section of the model uses multi-layer transformer decoders, a certain type of transformer. A decoder is kind of like the output part of the model in NLP.

We will be training this transformer with the BooksCorups dataset, consisting of over 7,000 wide genres of unique unpublished books! This data will allow the model to understand a wide range of text representations, which can then be applied to more specific tasks.

Supervised Fine-tuning

After pre-training the model, we will then do fine-tuning to the model's parameters so that it can fit the specific task.

Pre-training the model first is extremely useful as it will help improve the generalization of the supervised model and accelerate the convergence between the broader to the specific task!

GPT In Action!

Alright, let’s now look at how they tested this GPT approach. It was first tested in 4 specific domains in the NLP sector:

  1. Natuarl Languge inference (NLI)
  2. Question Answering
  3. Semantic Similarity
  4. Text Classification

After training GPT for each model, it showed that it manages to perform really well, and comparing it to other models (that weren’t pre-trained), it surpassed almost all of them. This shows how GPT is extremely powerful and can significantly increase the performance of many NLP tasks.

This article was based on the research paper known as “Improving Language Understanding by Generative Pre-Training.”

You can check out that paper here.