The AI tool of today: GPT3

15 min readFeb 2, 2023

You have probably seen a million other articles talking about this new tool. Well, let me set this straight before you start reading. This is not one of those articles. Why? It’s much better.

Parameters, models, transformer, encoder, decoder, training, natural language processing. All of these terms are constantly used when talking about GPT3 but do people really know and understand what these mean? Does anyone really understand how GPT really works?

From what I see, all the concepts are still pretty abstract and fantastical. Not anymore. This article is actually written in English!

Everyone is talking about GPT3. After reading this article, you will be the master of the conversation.

GPT3 overview

Source (https://decemberlabs.com/blog/openai-gpt3-the-new-ai-that-will-blow-your-mind-might-also-be-a-little-overrated/)

It is the third version of Open AI’s GPT language models.

Language models: tools that find patterns in language to predict words.

Through deep learning, it aims to develop human-like text when given a prompt.

Deep learning: It finds patterns in data without human intervention. In technical terms, deep learning is made of a neural network of more than 3 layers of neurons/nodes. It is a branch of machine learning in which models learn to train themselves.
Machine learning: branch of artificial intelligence that trains a computer to mimic how humans think.
(Machine Learning) Model: Predicts events based on data.

For example: “I was very happy this morning while I was having _____”

It is very, very likely that the answer will be “breakfast”.

Not ChatGPT

ChatGPT is an AI model used for chatbot purposes. It creates text that sounds like a conversation. I’m sure you have heard about it before.

Don’t get confused, this article is not about ChatGPT. I’m saying this because I have heard these two words (ChatGPT and GPT3) being used interchangeably. GPT3 and ChatGPT’s are “sister” models. ChatGPT is a variation of GPT3. It is sometimes even referred to as GPT3.5 because both models are similar.

They were both created and trained by OpenAI and are large language models. Another thing that they have in common is “GPT”. This stands for Generative Pretrained Transformer and it is a type of AI model architecture.

So how does it work?

Introducing transformer-based models

This is a type of neural network that creates, analyzes, or translates data. They are trained with large, unlabeled datasets to find mathematical patterns in data.

AI neural networks: a group of algorithms that mimic the way human brains think. Their aim is to process data the way humans do.
Unlabeled dataset: a set of data that has not been tagged with any information about the data (an example of a tag would be the object (tag) that appears in a photo(data)).

This type of model is mainly used in Natural Language Processing (NLP) and Computer Vision (CV) but it is very versatile and people have discovered applications in many other fields.

NLP: branch of AI focused in allowing computers to understand text.
CV: branch of AI focused in allowing computers to understand visual data (images or videos).

The goal of transformer models is to predict the next word. Sounds simple, right?

Well… it is not that simple. The architecture is very unique and its uses are (or seem) much more impactful than simply “predicting the next word”.

Transformers take an input (such as a sentence, an image, or any data that can be represented using numbers) and return an output based on that input. They were created for translation.

Ok so… why the hype around transformer models? The reason for this is that the entire input is processed simultaneously (known as parallel processing). For example, let's say we are translating from English to French. The entire English sentence is taken in and analyzed at once, not word by word. This allows for two things:

Speed: Transformer Models run fast compared to previous methods (such as RNNs or LSTMs) which take in each word individually.
Quality: They also create more coherent outputs because they look at every part of the input when creating the output.

Attention is the mechanism that allows for parallel processing. It is the ability the model has to consider and look at all the parts of the input when creating the output.
Self-attention (one type of attention) is what allows transformers to understand words in the context of the words around them. It is basically applying attention to itself.

The step-by-step

In terms of structure, the transformer models consist of an encoder and a decoder (actually, 6 stacked encoders and 6 stacked decoders, but we’ll keep it simple for now). They are both connected and they do some teamwork to create an output.

Source (https://arxiv.org/abs/1706.03762)

For the sake of this article, I will be explaining the training process of transformers using language translation because this was the initial purpose of this model, but remember this is not its only use.

Step 0 — Input/output embedding: from words to numbers

Words are mapped to vectors so they can be processed by the model. This strategically organizes words in a multi-dimensional space. Words with similar meanings are mapped close to each other. This is called an embedding space.

However, this alone is not enough for the model to know the meaning of the word. The position of the word in the sentence is also very important. The computer is somehow trying to understand the input so, how can it do that without knowing the order of the words?

Positional encoding puts the vectors from the embedding through math functions (cosine or sine functions, for the curious crowd) which add a stamp representing the position of the word in a given sentence.

If you horizontally cut any piece of this rectangle, you will get a unique combination of colors (each color represents a number). The model knows which combination of numbers belongs to which positions.

Adding positional encoding is extremely important because words are entered simultaneously. Therefore, this is the only way the model knows the order of the words.

Now let’s see how the transformer works.

WARNING! This section may get a bit mathematical and abstract, feel free to skip to the TL;DR below.

Step 1 — Encoder: Understanding the numbers (and producing some more).

Multi-headed attention: (this process is done for EACH INDIVIDUAL WORD).

The vector for a word is multiplied with the query (Q), key (K), and value (V) matrices (these are set during the training process) to create the Q, K, and V vectors for the given word.
Self-attention: The word is rated/compared against all the other words in the same sentence by getting the dot product of the Q vector of the word and the K vector of all the other words. This is done because other words in the sentence are relevant to the meaning of the word. The results are divided by 8 to keep the numbers at a manageable scale.

This is self-attention! Is repeated for each of the words.

Each of these vectors is passed through a softmax function (which converts all numbers to a value between 0 and 1) which produces a rating of how relevant the other words are for the meaning of the given word. We multiply the softmax score of each word with its respective V vector (that we created in the beginning). This way, the words with more relevance to the meaning of the word will have a higher weight in the result.
All of the weighted V vectors are summed. The result is a matrix in which each line corresponds to one word.
This is done 8 times, with 8 different query, key, and value matrices (hence the name Multi-Head). The 8 different matrices are combined all together to produce a single one. This is because it allows the transformer to look at different patterns in the input, and extract more complex features.
And that’s it! That’s the output of Multi-Head Attention.

Feedforward — All the vectors of each word go through a feed-forward neural network at the same time independently. (quick note: feed-forward neural networks are those in which data only flows in one direction, so you enter an input and you get an output). The output is one vector per word (the feature vector) that is digestible by the decoder.

(simplified) TL;DR
The encoder takes in the input vector and performs some calculations so that the output vector for each word has some sort of representation of the words around it (self-attention).

Step 3 — Decoder: Interpreting the numbers.

Even though the encoder processes all the words at once, not all the output is produced at once. It is produced word by word, and as it produces words, they are added to the input and fed back into the model. This way, the transformer predicts the next word until it has produced the whole output.

(If there is still not a first word, then a token representing the start of the sentence is fed instead.)

Masked multi-headed attention — This is the same process as multi-headed attention in the encoder. The word vectors are weighted against the other words in the same sentence. The big difference is that the input is not an entire sentence, the model just has access to the segment of the sentence that came before the word being predicted. (During the training process, the words coming after this word are “masked”/ hidden).
Multi-headed attention — This is also similar to the encoder’s process for Multi-head self-attention but the words of the decoder are now weighted against the words of the encoder. The output of the encoder is fed into this attention layer along with the output of the masked multi-head attention. This finds a relationship between the words of the input and the output.
Feedforward neural network — The same as in the encoder, the output is also called a feature vector for the word.

Because it only predicts the next word, the decoder can be used several times to produce the desired output.

TL;DR
The decoder first weighs the words of the input against themselves and then against the words of the encoder.

Step 4: Before we finish…

Linear layer —it is another feed-forward neural network that converts the output of the decoder into a (huge) vector that is as long as all the words in the language (one number is assigned for every word). This is called the logits vector.

Softmax — all the numbers of the logits vector are passed through the softmax function to output a probability per word. This is how likely a word is to be the next word in the sentence. The word with the highest probability gets selected (this is the output of the model).

Uses of transformer models:

Transformer models were created for translating text, however, many other uses have been discovered. For example, using images (pixels instead of words), or proteins (amino acids instead of words). Also, uses for the encoder and decoder separately have also been found.

Sequence-to-sequence models (both encoder and decoder): One sequence is input and another is output. The model has to be trained with examples. It is commonly used for translation or summarization.
Encoder-only models: Used to understand or analyze an input. It is used for masked language modeling — predicting a hidden word, and classifying sequences — such as sentiment analysis.
Decoder-only models: Used to generate data.

Wow, we went down a rabbit hole there. Let’s get back to GPT3.

GPT3

In GPT3, the input is fed into the encoder, and the decoder generates word-by-word the most appropriate output. The way that the model learned this is through the training process.

Training: When a machine learning model is using data to identify the best parameters (numbers) for getting to the outcome. Remember, that Machine Learning works with numbers. Those numbers that are set in the model when we are using it are decided upon during the training of the model.

GPT3 is a massive NLP model:

One of the reasons GPT3 has become so popular is the enormous amount of data it used for training. It was trained on 45 terabytes (a huge amount) of unsupervised (unlabeled) data from books, encyclopedias such as Wikipedia, and text found on the internet. The cost of the training was around 14 million dollars. Having this huge amount of data makes the model more useful in many contexts.
The model has 96 Attention Blocks! This means that the model goes through the attention process 96 times. This means that the level of complexity in interpreting the input and the output is really deep.
GPT3 is made of 175 billion parameters (GPT 2 was 1.5). This number is huge and creates more complicated functions and relationships between the input and output.

Parameters: values/numbers inside the model that determine the learning process of the model. This is what is found in training.

This is the number of parameters of GPT3 when compared to other NLP models.

Other cool ideas about GPT:

All things GPT3 outputs are based on the dataset it used in training. It uses examples to learn patterns. However, it said that it has somehow learned how to learn. For example, it is likely that in the training data, there are many repetitions of “2 + 2 = 4”. Then, if you input “2 + 2 =”, it will output 4 because it has memorized this pattern, not because it is performing the operation. Supposedly. Well, its accuracy is so good at very specific examples (eg. 2739852 / 8239, why would this appear in any part of the input data?) that it seems like it has learned to actually solve and understand problems.
Even though GPT3 wasn’t trained for it, it is capable of in-context learning. This happens when you feed a model with an example (or various examples) or an instruction and it can understand and learn what it is supposed to do, without having to modify any parameters of the model. Sounds crazy, I know. Read more about this here. This allows users to interact with the model and give instructions using natural language.

Source (https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api)

Using GPT3

When getting access to GPT3, you can choose one of the four GPT-3 models that have been created, which each optimizes for a different thing. The models have different capabilities in speed, output quality, and suitability for specific tasks.

Davinci — most expensive, slowest, highest quality.
Curie — cheaper, faster, and still high quality.
Babbage —cheap, very fast, intermediate.
Ada — very cheap, fastest, lower quality.

When using one of the models, you can also set a value called the “temperature”. Which is a value between 0 and 1 determining how accurate you want the model to be (0 is the least, and 1 is the most). 1 makes the model more creative, so the outcome could be more random.

Not-so-good things about GPT3

To use GPT3 you have to access Open AI API. Depending on the number of tokens that you use, the pricing for using GPT3 varies. Cost makes GPT3 less accessible.

Another problem is that GPT3 can only work with 2048 tokens -around 1500 words (this is both for input and output). This limits its ability to create longer content.
Ethically there are some concerns about the creation of biased content. This is caused because of biases in the training data; the reality is that a lot of the content found on the internet can be offensive. Because of this, GPT3 can generate racist and sexist outputs.

This was created using GPT2, but similar issues prevail.

GPT3 can also contain outdated information. It still thinks that Donald Trump is the president of the United States. This is because it memorized the data that it was trained on, but it doesn’t receive any new information.
Even though it was trained on a huge amount of information, GPT3 doesn’t know it all. The problem is that many times it can come up with false or inaccurate information. This lowers the trustworthiness of its outputs.

OK, enough about the downsides. Despite having various things to improve on, it is still very useful and helpful.

Uses and other businesses that incorporated GPT3

DALL·E

Even though DALL-E is also a creation of Open AI, it is an amazing tool based on a modified version of GPT3. It generates images from scratch when given a prompt or modifies existing images based on a prompt.

Trained once and required minimal tuning. Before it had been either one of them, but GPT was the first model that doesn’t require either.

CopySmith

CopySmith is a content creation software aiming to increase productivity and creativity. The content is meant to be used for marketing. It writes product descriptions, creates posts, and more. It incorporated GPT3 into its software.

Replier.ai

Replier.ai replies to customer reviews using GPT3. It provides unique and custom responses depending on the business style. To ensure accuracy, it “cleans” the output of GPT3 before posting the reply.

Jasper AI

Jasper is a writing assistant based on GPT3. It helps in plagiarism-free content creation that can also be used in marketing and repurposing of previous content. It can be used to create captions, scripts, posts, images, emails, and more.

Debuild.co

Debuild helps to code apps and websites using GPT3. It doesn’t require users to create any code, and it creates all the visual layouts. It is supposed to write software at a level of experienced software engineers.

Copilot

Open AI and Microsoft created Copilot to help users write better code. It shows suggestions to add to the code based on the context. It uses OpenAI’s Codex (a variation of GPT3 trained on billions of lines of code) to analyze text and create coherent code.

Summary

GPT3 stands for Generative Pre-trained Transformer and it is based on a model called the transformer.
The transformer model consists of an encoder that takes an input all at once and an encoder that predicts the next word (one by one) based on the context of the sentence.
Even though it has setbacks, GPT3 has endless possibilities and applications. Platforms that use GPT3 can be used to code, create original content, create apps and websites, and even create art pieces.

I hope that you enjoyed learning about GPT3 and that you understood all the buzzwords around it. The next time that someone brings up this topic of conversation, you are going to master it.

If you want to see more content around AI, climate change, and innovation, make sure to follow me!

References

6, A., & Johnson, J. (2020, April 6). What is a language model? BMC Blogs. Retrieved February 1, 2023, from https://www.bmc.com/blogs/ai-language-model/

Bhattacharyya, S. (2022, July 6). Commercial applications of GPT-3 that are already live. Analytics India Magazine. Retrieved February 1, 2023, from https://analyticsindiamag.com/commercial-applications-of-gpt-3-that-are-already-live/

Business applications for GPT-3. Width.ai. (n.d.). Retrieved February 1, 2023, from https://www.width.ai/post/business-applications-for-gpt-3

GPT-3: Language models are few-shot learners (paper explained). YouTube. (2020, May 29). Retrieved February 1, 2023, from https://youtu.be/SY5PvZrJhLE

Machine learning vs deep learning. YouTube. (2022, March 31). Retrieved February 1, 2023, from https://youtu.be/q6kJ71tEYqM

OpenAI. (2021, November 12). Pricing. OpenAI. Retrieved February 1, 2023, from https://openai.com/api/pricing/#faq-search-pricing

Perrigo, B. (2021, August 23). Artificial Intelligence wrote a play. it may contain racism. Time. Retrieved February 1, 2023, from https://time.com/6092078/artificial-intelligence-play/

Romero, A. (2021, June 12). Top 5 GPT-3 successors you should know in 2021. Medium. Retrieved February 1, 2023, from https://towardsdatascience.com/top-5-gpt-3-successors-you-should-know-in-2021-42ffe94cbbf

This text generation AI is insane (GPT-3). YouTube. (2020, June 12). Retrieved February 1, 2023, from https://youtu.be/lQnLwUfwgyA

Transformer neural networks: A step-by-step breakdown. Built In. (n.d.). Retrieved February 1, 2023, from https://builtin.com/artificial-intelligence/transformer-neural-network

Why we don’t use GPT-3. Article Forge Blog. (2022, November 22). Retrieved February 1, 2023, from https://www.articleforge.com/blog/why-we-dont-use-gpt-3/