What the GPT?! — Understanding your favourite Chatbot in 5 Minutes.

Marcelo Chaman Mallqui
QMIND Technology Review
7 min readApr 3, 2023
An oil painting of a transformer in a suit eating a cookie on a couch with a cream white background

How in the world does ChatGPT work? Is it sentient? Is it smart? Can it cook me dinner?

For the last 3 questions: No, yes, I wish — but I’ll explain the first one.

Today, I write for 2 reasons. Firstly, I realize that not everyone knows how ChatGPT works, and more important, what a GPT is.

Secondly…

Following a conversation about using a GPT model for note summary

My best friend doesn’t know that ChatGPT isn’t the only GPT model out there.

After reading this article, you will be able to explain how your favourite chatbot, ChatGPT, works and the AI model behind it. The language in this article is not greatly technical, so all you need is 5 minutes, a brain, and scrolling fingers.

What the GPT?!

GPT stands for Generative Pre-trained Transformer.

  • Generative means that the model can generate unique human-like output by building a sentence word by word, predicting what word would most logically (and statistically) come next in a sentence.
  • Pre-trained means that the model is trained with large data-sets in an unsupervised manner. It isn’t explicitly told to complete a certain task, but over time it learns to understand the statistical patterns and relationships in the text data, enabling the user to then fine tune it to serve a specific purpose.
  • Transformer refers to the architecture of the neural network used in this model (How the AI is structured behind the scenes)… and this is what I’ll be explaining today.

How it Works… Generally

It starts with an input which is fed to an encoder. The encoder will process the text, often more than once, creating a set of context-aware data representation of the input text. This data is then fed into a decoder, which generates the output text, one word at a time.

The Transformer (Robots in Disguise)

On my 14-hour flight to Japan in February, I got a text:

The minutes prior to my mental confusion… mental confussiiionnnn

… two seconds into the paper, I was on my knees on a United Airlines plane, defeated by the words in the abstract. Today, I walk out victorious, and so do you.

The (Brief) History

In 2018, eight Google researchers published a paper by the name “Attention Is All You Need” that described a new architecture for machine translation called the Transformer. The Transformer would use an ‘attention mechanism’ to understand the context of a sentence, and is the basis of what we know as GPT-1.

  • The team consisted of Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin.

This new architecture enabled users to speed up it’s execution with more GPU power which was not possible with previous architectures that used sequential processes (like RNNs).

  • This meant faster and cheaper usage with better results.

We Are Attention Mechanism

When we listen to a sentence, we don’t just look at the words to understand what it means. Instead, we consider the order of the words and create an understanding of the sentence as we read every new word. The ability to understand context is what enables us to respond appropriately.

For example, considered this Daredevil line:

  • “You failed this city.” vs “This city failed you.”

These two sentences have opposite meanings, even though they contain the same words. Attention mechanisms work the same way, using complex math to determine which words are most important to consider when trying to predict the next.

Watch the breakdown!

To follow along, we’ll break down this diagram, using ChatGPT as the example GPT-3 model:

Don’t be scared, we’ll go piece by piece…quickly

Input Embedding

When you type into the ChatGPT, the computer doesn’t understand words; it takes each one of those words and assigns it a value.

What value? Good question.

Imagine you could give it a value according to it’s position in an English dictionary. Aardvark would be 1, and the word after would be 2, and so forth. Unfortunately, this would give no meaning to the value in terms of context, or meaning. Instead of using an English dictionary, the embedding layer uses a special dictionary where words with similar meaning have similar vectors.

For example:

  • Using an English dictionary — Cat and Car would have a very similar value. Cat and Dog would have a much larger difference.
  • Using the special dictionary — Cat and Dog would have a very similar value. Cat and Car would have a much larger difference.

Positional Encoding

As each word is read by the model, it is multiplied by a value that is unique to it’s position in the sentence. This helps the model form context of what the sentence is about. Once this step is completed, the input is finally ready to be fed into the aforementioned attention mechanism.

The Encoder

The encoder takes the input and processes it through two main components: Self-Attention (labeled Multi-Head Attention), and Feed Forward.

Self-Attention

Using the inputted data, the machine looks at the position and vector value of each word to form an understanding of how these words may relate to one another, and the relationship they may have. After training, this layer is what understands that a pronoun in a sentence is likely referring to a previously mentioned noun.

For example:

  • “Can you ask for the check?” **vs ** “Can you check the time?”

Both sentences begin with “Can you” and include “check”, but have drastically different meanings. The self-attention step helps us format the data given to the next component

This is crucial for understanding what data to pass to the ‘Feed Forward’ layer.

Feed Forward

This layer simply refines the data given to it by the self-attention layer, removing some of the ‘noise’ or irrelevant information that it received, based on previous training. The result from this layer is ultimately passed onto the next step.

The entire encoder component is repeated multiple times depending on the model.

The Decoder

Originally, the decoder takes the input of the encoder and begins to decode it into a new language. However, the decoder can be fine-tuned to do other things, such as answer questions like ChatGPT

Embedding & Encoding… Again

It begins with the same embedding and positional encoding as before. So we won’t explain that again.

The Masked Attention

In order to prevent the model from attending to future words in the sentence given in the input, it restricts the attention of the model to only care about the previous words in the sentence. The masked attention layer helps the model predict the next word in the sentence, one at a time, until the next predict ‘word’ would be to simply end the sentence.

The remaining components of the model function similarly to the encoder. However, this time, it translates the input into the target language that the decoder was designed for.

Final Words.

Needless to say, the GPT model has come a long way. Behind every generative AI tool, there is a model similar to GPT, which was developed by OpenAI and is built on the foundation of transformers. As our world becomes increasingly technical, it is essential to comprehend the technology that surrounds us. Even if you don’t have a technical job, the more you understand these tools, the better.

The evolution of generative AI has opened up opportunities for the monetization of its applications. These tools are a great benefit to the tech industry, and many companies have began to focus on developing more generative tools. This means that other areas of tech may end up being neglected, opening more opportunities for those who don’t follow the trend. Will you build for the hype, or build elsewhere?

This article was written for QMIND, Canada’s largest undergraduate organization on AI.

If you have any questions, comments, or concerns – or want me to break down another topic, please email me at marcelo@qmind.ca!

--

--