Chat GPT and GPT 3 Detailed Architecture Study-Deep NLP Horse

Yashu Gupta

Published in

Nerd For Tech

9 min readMar 2, 2023

A detailed intuition and methodology behind the GPT and Chat GPT Language Models.

Transformers -Detailed Architecture intuition can be found at this Link

From Transformers (Attention is all you need ) to Chat GPT (Generative Pretrained Transformers)

Transformers are taking the NLP world by storm as it is a powerful engine in understanding the context . These incredible models are breaking multiple NLP records and pushing the state of the art. They are used in many applications like machine language translation, NER, Summarization, conversational chatbots, and even to power better search engines. In my Recent post on Transformers- Attention is all you need ,We have covered the detailed intuition and methodology on Transformers. In this post we’ll focus on intuition and methodology of GPT 3 Architecture and Latest Chat GPT LM architecture.

GPT 3 Language Model

GPT-3 (Generative Pre-trained Transformer 3) is a language model that was created by OpenAI. The 175-billion parameter deep learning model is capable of producing human-like text and was trained on large text datasets with hundreds of billions of words.

GPT uses an unmodified Transformer decoder, except that it lacks the encoder attention part. We can see this visually in the above diagrams. The GPT, GPT2, GPT 3 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks. GPT-3 was trained with huge Internet text datasets — 570GB in total. When it was released, it was the largest neural network with 175 billion parameters (100x GPT-2).GPT-3 has 96 attention blocks that each contain 96 attention heads

How GPT3 Actually Works - Pretraining

GPT-3 uses the same unmodified Decoder architecture as we discussed in Transformer Post .GPT-3 was trained using “next-word prediction, task a kind of Unsupervised training in which — it predicts the next word in a sentence. The input sequence is actually fixed to 2048 words (for GPT-3). We can still pass short sequences as input: we simply fill all extra positions with “empty” values. The GPT output is not just a single guess, it’s a sequence (length 2048) of guesses (a probability for each likely word). One for each ‘next’ position in the sequence. But when generating text, we typically only look at the guess for the last word of the sequence.

Encoding-:But wait a second, GPT can’t actually understand words. The first step is to keep a vocabulary of all words, which allows us to give each word a value. yashu is 0, shobham is 1, and so on. (GPT has a vocabulary of 50257 words).It uses sub words algorithms to create the vocab, if word is not present in the current dictionary of model, GPT-3 actually uses byte-level Byte Pair Encoding (BPE) sub word tokenization method .Based on sub words BPE algo it creates positional encodings where it convert sub words to the vectors..

Multi Head Attention-:

Once Encoding is done , We calculate the key, query, Value Vectors for each token of our sequence using dot product matrix multiplication to produce a score matrix. The score matrix determines how much focus should a word be put on other words. So each word will have a score that corresponds to other words in the time-step. The higher the score the more focus (attention ) will be there.
Then, the scores get scaled down by getting divided by the square root of the dimension of query and key. This allows more stable gradients, as multiplying values can have exploding effects.Then we do the SoftMax of scaled score to get the attention weights, which gives probability values between 0 and 1.This allows the model to be more confident about which words to attend too

For each component working Please have a look on my Transformers Detailed Post.

GPT-3 Training

The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence

Since Pretraining objective is Next word Prediction. The model is presented with an example. We only show the features and it will predict the next word.

In the above GIF . we are passing tokens to the model .Model is predicting one token at a time since it is auto regressive model. First time the model’s prediction will be wrong. We calculate the error by showing the correct output and update the model parameters, so next time it makes a better prediction. This process will be repeated many times.

The untrained model starts with random parameters. Training finds values that lead to better predictions. Every time it outputs a token. It will be added to the Input sequence to predict the next token until it finds a END token. The future tokens are masked in the Input sequence while training and are shifted right by1 token. Below is the detailed intuition

Above GIF represents the input and response (“Okay human”) within GPT3. Notice how every token flows through the entire layer stack. We don’t care about the output of the first words. When the input is done, we start caring about the output. We feed every word back into the model.This is how GPT-3 works .

CHATGPT- Explained step by step!!

It is a variant of the popular GPT-3 (Generative Pertained Transformer 3) model discussed above, which has been trained on a massive amount of text data to generate human-like responses to a given input. Chat GPT was modified and improved using both supervised and reinforcement learning methods, with the assistance of human trainer (RLHF).Chat GPT also has 176 billion parameters same as GPT -3 model. The learning includes 3 Steps.

Supervised fine tuning of GPT 3.5 Model
Reward Model
Proximal Policy Optimization (PPO)

Supervised fine tuning (Step1)

In first Step a pretrained GPT-3 model is used and it will be fine tuned with the help of labelers by creating a supervised dataset. Input Queries were collected from the actual user entries and model generated different responses with respect to that input prompts. The labelers then wrote an appropriate response to the input prompt’s (how they want to see that prompt to be answered).The GPT-3 model was then fine-tuned using this new supervised dataset, to create GPT-3.5 model.

Reward Model (Step 2)

After the SFT 3.5 model is trained in step 1, the model generates better responses to input prompts. In this step SFT model is used and different input/prompts queries fed to the finetuned model and different responses were generated (4 to 7)for every input/prompt. Then labeler determines a reward for each of these outcomes and this reward is proportional to the quality of response with respect to initial prompt .The Labeler rank the output's in sequence order of best to worst. Then we can use this data in order to train a reward model . The input for a reward model will be the user prompt and one of the responses we generated and output will be a scaler value which determines the quality of response with respect to the input prompt. Also we use rankings that we generated in the past in order to train this reward model.

But But… There is a question which can come up -How GPT can generate different outputs for one input prompt/query lets say…

Let us understand in detail .

If we can see in the above image we fed an input sequence to the GPT model. The output will be, At every timestep it will generate one word at a time and that word will be added to input sequence for generating the next word Since it is a autoregressive model .

at t=0 output will be Today

at t=1 output will be we

at t=5 there will be different outputs based on probabilistic function p[w5|w0:4,4]. GPT uses different decoding strategies like Nucleus sampling, Temperature sampling, Top-K sampling . Based on these strategies it will generate different outputs for every time step. These are parameters which can be further tuned.

at t=5 GPT has generated multiple outputs and based on decoding strategy model will choose one of them.

The other question which can come up is how labeler quantify the quality of Responses as mentioned in the original architecture of chat GPT

For every single Labeler there will be screen in which they have to provide the ratings and also they have to provide Phycological reasons with respect to response. The rating will be picked up based on the similar Phycological reasons given by the labeler’s. Based on these rating reward model will be trained .

Reward Model Training

While Training the reward model we use the same supervised fine tune model which we got in step 1, But the input to the model will be the user prompt, responses and output will be a reward (Just like a Siamese Network model)

loss function will update the parameters to generate better reward value

Then we can use this reward model in step3 to get the quality of unseen response .Let go to last step3

Proximal Policy Optimization (PPO) RL algo- Step3

In this step we pass unseen input sequences to the clone SFT model we got in step1. The model will generate response with respect to the input prompt. We pass the response to our reward model which we got in step 2 to understand, how high quality was this response for that input prompt and the output reward will be used to finetune the parameters of our SFT model .This is how our SFT model will incorporate more human like characteristics and behavior’s via Reinforcement Learning.

Wait this looks simple. Let us Understand in Detail — Step3 The last ride

Let us Understand with example in each time step . Suppose our prompt is what is for breakfast .

At t=0 we will get Today ….

and the process go on at t=5 it will generate Toast

Once we get the response from the model this will be passed to the trained reward model. This will tell us the quality of response with respect to the input sequence. With this reward we can update the parameters of clone SFT model .The GPT model updated via PPT Proximal policy optimization

Goal of PPO-:Maximize the total reward of responses generated from the model by including reward in the Loss. If the response is very good the product of r and Advantage function (A^) will be large. If the advantage function will be negative the response will be bad.