Behind-the-Scenes: The Tech Behind ChatGPT šŸ¤¬ (super friendly ver.)

Anix Lynch
6 min readFeb 5, 2023

--

If ChatGPT were to be compared to animal, it would be like a hungry little caterpillar! šŸ› Just like how caterpillars love to munch on leaves and grow bigger, LLMs love to gobble up HUGGGGEEE amounts of text data and use it to get smarter and bigger. The more text data LLMs consume, the more they understand language and relationships between words. And just like how caterpillars turn into beautiful butterfliesšŸ¦‹, LLMs turn into powerful language models that can understand and generate human-like responses.

Language models are trained to predict the next word in a sequence, and there are two common ways to do this: next-token-prediction and masked-language modeling.

Next-token-prediction sample,

ā€œThe cat sat on theā€¦.ā€

A next-token-prediction model would be trained to predict the next word after ā€œThe cat sat on theā€ Given the input ā€œThe cat sat on theā€, the model may predict ā€œmatā€, ā€œcouchā€, or ā€œchairā€šŸ˜ø

Masked-language-modeling sample.

The quick brown [MASK] jumps over the lazy dog.šŸ¶

In this case, the model would try to predict the missing word ā€œcatā€. It does this by using the context of the words around it, such as ā€œquickā€ and ā€œbrownā€, to understand the relationships between them and make a prediction. The goal of masked-language-modeling is to train the model to fill in missing words in a sentence in a way that makes sense and is grammatically correct āœļø

Sequential modeling using Long-Short-Term-Memory (LSTM) models is just one way to predict words in a sequence. But it has its limitations.

For example, the model canā€™t give different weights to different words in the context, even though sometimes, like in the case of our cute kitty, one word might be more important than another. And the input data is processed one step at a time, which means the relationships between words may be limited. (oh no! šŸ˜°)

Thatā€™s why in 2017, some smart people at Google Brain came up with transformers. Transformers are different from LSTMs because they can process all the input data at once! They use a cool thing called self-attention, which means the model can give different weights to different parts of the input data. This makes the relationships between words more complex and the meaning richer šŸ˜±

GPT and Self-Attention GPT-1 was the first Generative Pre-training Transformer model made by openAI in 2018. It evolved over the next few years, becoming GPT-2, GPT-3, InstructGPT, and finally ChatGPT.

Before humans started giving feedback to ChatGPT, the biggest change in the GPT models was that they got faster and faster, which allowed them to be trained on more and more data. This made them more knowledgeable and able to do a wider range of tasks, yay! šŸŽ‰

GPT models, like ChatGPT, use transformers, which have an ā€œencoderā€to process the input and a ā€œdecoderā€ to generate the output. The encoder and decoder both use multi-head self-attention to understand the relationship between words and produce more accurate responses.

Multi-head self-attention is like giving the robot superpowers to pay attention to multiple things at once! šŸ˜Ž

Self-attention converts tokens into vectors that represent their importance in the input sequence by creating a query, key, and value vectors for each token and then using a softmax function to generate normalized weights. The multi-head attention mechanism performs self-attention several times, allowing the model to grasp complex relationships within the input data. Good job softmax!šŸ˜‡

Although GPT-3 brings such amazing advancements in natural language processing, it still has its limits in aligning with usersā€™ intentions. It may produce outputs that lack helpfulness, have hallucinations with non-existing or incorrect facts, lack interpretability, or even have toxic or biased content, like a drunk little botšŸ¤¬

Enter ChatGPT, a spinoff of InstructGPT, which introduced a new way of incorporating human feedback into traininā€™ to better align the modelā€™s outputs with usersā€™ intent.

The Reinforcement Learninā€™ from Human Feedback (RLHF) was detailed in OpenAIā€™s 2022 paper and itā€™s simplified here:

Step 1: Supervised Fine Tuning (SFT) Model šŸ“»

The first step was to fine-tune the GPT-3 model by hiring 40 contractors to create a supervised training dataset, where the input has a known output for the model to learn from. Prompts were collected from user entries into the Open API and the labelers wrote an appropriate response to create a known output. The GPT-3 model was then fine-tuned with this new dataset to become GPT-3.5, also known as the SFT model.

To maximize diversity in the prompts dataset, only 200 prompts were allowed per user ID and any prompts with long common prefixes were removed. Also, all prompts with personally identifiable information were removed for privacy reasons.

Labelers were also asked to create sample prompts for categories with minimal real sample data, including:

  • Plain prompts: any random ask.
  • Few-shot prompts: instructions with multiple query/response pairs.
  • User-based prompts: correspond to specific use cases requested for the OpenAI API.

When generating responses, labelers tried their best to infer the userā€™s instruction. The paper says that prompts request information in three main ways:

  • Direct: ā€œTell me aboutā€¦ā€
  • Few-shot: Write another story about the same topic with given examples.
  • Continuation: Finish the start of a story.

With prompts from the OpenAI API and hand-written by labelers, the supervised model had 13,000 input/output samples to work with!

Step 2: Reward Model

In step 2 of the process, the model is given a treat! šŸŖWe train a reward model so it can learn how to make the best responses to user prompts. This reward model takes the prompts and responses as inputs and gives us a cute little scaler value called a reward as an output. And with this reward model, we can do Reinforcement Learning and make the model even more awesome!

Reinforcement Learning is like a game of ā€œgood jobā€ or ā€œtry againā€ with ChatGPT! The model gets a reward when it gives a correct response and gets some feedback on what it did wrong when it doesnā€™t. This feedback helps it learn and improve its responses over time, itā€™s like giving ChatGPT a pat on the back every time it does a good job!

Awwā€¦.

Reinforcement Learning

To train the reward model, we ask some lovely labelers to rank the outputs of the SFT model from best to worst. šŸ”¢ And we put all these rankings together to train the model, so it doesnā€™t get too confused by all the information.

We used to include each combination as a separate data point, but that caused overfitting. Overfitting is like a kid that only wants to play with its own toy, and wonā€™t pay attention to anything else. To prevent overfitting, we give the model a reward by grouping the data points together into a single batch datapoint, so it can learn to be more flexible and adaptable to new situations!

Step 3: Reinforcement Learning Model

In the third step, itā€™s time for the Reinforcement Learning Model to shine! The model is given a prompt and it wags its tail to generate a response. The response is made with the help of a ā€œpolicyā€ that the model learned in step 2. This policy is like a secret strategy the model figured out to get more treats (aka maximize reward). Then, the reward is given to the model based on a reward model made in step 2. This reward helps the model grow and evolve its policy, just like how treats make a doggy happy!

In 2017, some smart people named Schulman et al. introduced a fun way to update the modelā€™s policy called Proximal Policy Optimization (PPO). It uses something called a per-token Kullbackā€“Leibler (KL) penalty šŸ„µ from the SFT model. KL divergence is like comparing two different treats, and it helps to make sure the response isnā€™t too different from the human intention dataset, so the model doesnā€™t get too distracted chasing its own tail!

The model was evaluated by putting aside some data it never saw before during training. The test data was used to see if the model is better than the old one, GPT-3. They checked how helpful it was, how truthful, and how much it avoided being mean. They found out that people liked it better 85% of the time and it was more truthful and not mean when it was told to be nice. But when it was told to be mean, it was meaner than GPT-3.

Voila! and thatā€™s how your cute little friend ChatGPT was made! šŸ„³

--

--

Anix Lynch

I enjoy making difficult AI concepts visually skimmable with analogy, and funny samples.āœØ