The Journey and Architecture of ChatGPT: A Look Behind the Scenes

Margaux Vander Plaetsen
7 min readMar 8, 2023

--

What is behind ChatGPT and how did we get here? That’s what I would like to cover in this blog. For many of us, this is still a black box. While there are plenty of resources on the impressive capabilities of ChatGPT, I will focus on the AI journey that got us here.

ChatGPT Visualized as a personal assistant

Introduction

Demonstrated by the generated image above: ChatGPT is like having your personal assistant who’s always by your side, ready to float through any task or topic with you, day or night. It can understand and respond to human inputs like never before. But of course, creating this model did not happen overnight. Therefore, to fully understand the architecture and workings of ChatGPT, it is important to take a step back and talk about the evolution we have gone through.

Before we begin, let’s establish what GPT is. Do we know what the three letters stand for?

  • G = Generative — meaning that the model can generate new content that is similar to the data it has been trained on
  • P = Pre-trained — meaning that the model has been trained on vast amounts of text data
  • T = Transformer — referring to a type of neural network architecture for natural language processing tasks

Once upon a time, Artificial Intelligence (AI) was just a futuristic idea. Back then, to move from input to output, one needed a well-defined algorithm. But as technology advanced, AI became a reality and Machine Learning (ML) came into the picture, a subset of AI that allows computers to learn from data. With Deep Learning (DL) the game changed even more.

As a quick intermezzo and reminder, let’s shortly reflect on the differences between Automation vs. Traditional ML vs. Generative AI through the following application to the topic of Fraud Detection:

Deep Learning

Deep Learning makes use of neural networks to learn patterns from data that are inspired by the human brain. It is called DEEP learning because the neural networks are composed of multiple layers of interconnected nodes, called artificial neurons. These neurons process and transmit information.

neural network

This network is defined through the following parameters:

  • The connections between the neurons are represented by weights. They are like knobs that the network can use to adjust the importance of different features in the input data.
  • Each neuron also has a bias, which determines the activation threshold of the neuron. It is like a baseline value that the network can adjust to make sure that the output is properly centered.

As you can see above, neural networks take the input (e.g. picture of a cat), and process it through the layers of nodes. Each node applies a mathematical function to the input it receives. This function takes in the weighted sum of the neuron’s inputs, adds the bias, and then decides whether the neuron should send an output based on the activation function. Finally, the neuron sends its output to the next nodes in the network. The output layer will make a prediction (e.g. is it a cat or a dog?).

Now, when we input a picture of a cat, without having trained the network, the resulting output is likely to be rubbish. We first need to learn the right weights and biases, meaning that we need to train our network by feeding it training data (e.g. labeled pictures of cats and dogs). Optimization algorithms like gradient descent and backpropagation come into play here. In short, we want to minimize the mistakes that our network makes, i.e. the difference between the predicted and observed label (e.g. the model predicts the picture to be a dog while it is actually a cat). The cost function measures this difference, and its gradient or slope tells us which direction to move in. Backpropagation helps you to calculate these gradients and gradient descent uses them to tweak the weights and biases, leading to more accurate predictions. The process can be visualized as follows:

source

Neural networks first learn easy features (e.g. edges) and gradually learn to recognize more complex patterns (e.g. tail, ears) by adding more layers of neurons.

Generative AI

More recently, the world was introduced to Generative AI. The goal of Generative AI is to develop algorithms that can learn the underlying probability distribution of a given dataset and use this knowledge to generate new examples that are similar to the examples in the dataset.

Getting back to our example of cats and dogs, generative models help answer the question of what is the “cat itself” or “dog itself.” It learns to get an idea of what those animals look like in general. As a consequence, they can recreate images of cats and dogs, even those that were not in the training set. On the other side of the spectrum, we have discriminative models, learning the differences or the boundary between cats and dogs, without trying to understand what a cat is and what a dog is.

discriminative models vs. generative models (source)

Transformers

Until the rise of transformer-based architectures, generative models for NLP were typically based on recurrent neural networks (RNNs) or variants of RNNs, such as long short-term memory (LSTM) networks. However, these models had limitations in their ability to capture long-term dependencies and context in the input sequence. Transformers came to the rescue. These models transform one sequence into another and follow the underlying encoder-decoder architecture using a self-attention mechanism.

💡
Self-attention in transformers is like a way for a computer to understand the relationships between words in a sentence. Just like how we pay attention to each word when we read a sentence to understand what it means, the computer uses self-attention to do the same thing.

transformer architecture

The encoder extracts features and creates a vector of the relationship between each word and all of the other words. The decoder uses this to work on an output sequence. Coincidence or not, one of the most well-known transformer-based models happens to be GPT.

GPT vs. ChatGPT

GPT-3, the most common iteration of the GPT series, is one of the largest and most powerful language models to date, backed by a massive amount of data. With its 175 billion parameters and a decoder-only transformer architecture, the model uses deep learning to produce human-like text. It is trained to predict what the next token is. Given an initial text as prompt, it will produce text that continues the prompt.

Its successor, GPT-3.5, continues to push the boundaries and is an improved version of GPT-3. GPT-3.5 implemented safety mitigations, such as reinforcement learning from human feedback (RLHF) to address concerns about harmful and untruthful outputs. Fine-tuning with humans in the loop helps improving safety & reliability and limits toxic outputs. For instance,text-davinci-003 is an improvement on text-davinci-002

Finally, ChatGPT is fine-tuned using transfer learning from a model in the GPT-3.5 series, teaching the model a new task/structure. ChatGPT is designed to have interactive dialogs. To achieve this, 3 different methods were used:

  • fine-tuning with conversational data
  • reward model where results are ranked
  • reinforcement learning

Find more details here or in the picture below:

source

Important to know is that during the fine-tuning process, some parameters become irrelevant to the task/dataset, reducing the overall parameter count. As a result, ChatGPT is a smaller model, with only around 1.5 billion parameters 😉.

To conclude, we now understand what the 3 letters of GPT mean, and the story behind it. The evolution of AI has brought us remarkable advancements in the field of deep learning and generative AI. The introduction of Transformers played a crucial role in the development of powerful language models like GPT-3 and ChatGPT.

🍭 If you found this article even slightly useful, I’d be happier than a kid in a candy store if you could give it a clap and follow me on Medium.

Thank you!

--

--