ChatGPT: A no-jargon explanation

Tanuj Jain
Axel Springer Tech
Published in
12 min readMar 7, 2023
Photo by Emiliano Vittoriosi on Unsplash

Are you familiar with gradient descent and have practised some form of machine learning at some point in your life? If yes, congrats! You can easily understand the basic building blocks of ChatGPT.

This article is an attempt at simplifying the explanation of the ChatGPT paper and keeping jargon to a minimum.

The methodology behind ChatGPT is exactly the same as one of OpenAI’s previous models- instructGPT: https://arxiv.org/pdf/2203.02155.pdf

The training comprises 3 phases:

  1. Supervised Fine Tune (SFT) a pretrained GPT3 model
  2. Train a Reward Model (RM)
  3. Train a Reinforcement Learning Model (RL)
Basic phases of training

Let’s look at each step in detail.

1. Supervised Fine Tune (SFT)

The goal of this step is to fine tune a pretrained GPT3 model using human generated responses.

OpenAI hired a bunch of humans to write high quality prompts and responses, threw the prompt and responses at a pretrained GPT3 model in order to finetune it and voilà, the step was completed!

SFT: Inputs and expected outputs are provided at the time of training

Model

A pretrained GPT3 model is used.

Dataset

OpenAI collected prompts not just from humans, but also from their API. The responses to each of these were written exclusively by humans. The prompts were taken from a variety of task types- brainstorming, classification, Q&A, summarization, chat, extraction, etc. Some example of the prompts:

  1. List five ideas for how to regain enthusiasm for my career
  2. {java code} What language is the code above written in?
  3. Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home.
  4. Who built the statue of liberty?

The paper doesn’t mention the responses to the prompts but rest assured, some human wrote a response to each prompt. (In fact the examples above are just a mimicry of the actual prompts; authors didn’t disclose the real data).

Training regime

If you’re familiar with sequence modelling (specifically ‘Next Token Prediction’), feel free to skip this section. If not, read on.

Since this is a supervised learning setup, one has to have access to the input and its corresponding target at the time of training. This is a GPT model- the task is to predict the next token in the sequence given all the past tokens. Let’s take a simple example:

Prompt- How are you doing?

Response- I am good

First step is to concatenate the prompt and the response-

How are you doing? I am good

Assuming one word is one token (which is not the case since GPT typically uses Byte Pair Encoding, but it simplifies our explanation without loss of generality), we have the following tokens-

‘How’, ‘are’, ’you’, ‘doing’, ‘?’, ‘I’, ‘am’, ‘good’

- this gives us 8 tokens (’?’ is an individual token).

Now, at each timestep, the output of the model comprises a probability distribution over the entire vocabulary of the dataset. Eg: if there are a total of 50,000 different tokens the model is allowed to produce, these 50k tokens would make up the vocabulary of the model. Our 8 tokens are 8 entries of this vocabulary. So, at each time step, a probability distribution over 50,000 tokens is produced. These 50k values denote the probability of each of the 50k tokens to be the next predicted token in the sequence. Since here we already know the whole sequence, we know exactly what we expect the probability distribution to be at each time step- a 1.0 for the token we know exists at the timestep and 0.0 for all other tokens in the vocab. In our example, for each timestep, 49,999 entries would be 0.0, only one entry would be 1.0 — this would be the expected training distribution.

Finetuning Process for the first 4 timesteps

The training process depicted in the above diagram for the first 4 time steps can be broken down as below:

Given our sequence of 8 tokens- How are you doing? I am good

Let’s look at the first timestep-

At t=0,

expected output distribution- P(‘how’) = 1.0 given a beginning-of-sentence (<bos>) token to start generation, all other tokens P(token)=0.

Expected probs at t=0

Since the model is untrained, P(‘how’) would not be 1.0 (=0.85 in the diagram). Similarly, all the 49,999 probabilities would also not be 0.0. Hence, we can calculate a loss between the expected and actual output. We use this loss to update the model.

At t=1,

expected output distribution- P(‘are’ | ‘how’) = 1.0, all other tokens P(token | ‘how’)=0.

Expected probs at t=1

We already know the first token in the sequence(= ‘how’) and would now want to get the second token(= ‘are’). Again, an untrained model produces probabilities that are different from expected. So, we incur a loss that we backpropagate and update the model.

The above procedure goes on for all tokens in the sequence and through multiple sequences (thousands).

When the SFT is trained, one can trigger a generation process by adding a prompt and let the model produce one token per timestep (one token can be selected from the the probability distribution produced for that timestep by, for example, selecting the highest prob token in the distribution).

We end up with a Supervised Fine Tuned model (SFT) at the end of this procedure.

SFT model is where the story usually ends for a majority of ML practitioners at the moment. The model trained using usecase specific data is now ready to start generating. However, the model is still likely to produce responses that are not preferable to humans. As an example, let’s say we have a prompt- How are you doing?

3 different runs of the model produce 3 different responses:

  1. I am good
  2. Doing good. How about you?
  3. Apple is a nice fruit.

It is easy to see that 3rd response makes no sense for the given prompt, while it could be argued that 2nd response is marginally better than the first one. So, not all responses are created equal! What if one could train a model that would produce responses that are more aligned with human preference? This is precisely the goal of the next 2 phases.

2. Train a Reward Model (RM)

One could take the responses produced by the SFT model and ask a human to rate those responses on the basis of how close those are to the responses a human would write for that prompt. Once such a rating is available, it could be used to update the model to align the generation with human-preference.

Simulating human preference is the goal of this second step- training a so-called Reward Model. Here the term ‘reward’ refers to the rating mentioned above- a higher reward means that the human is more likely to prefer that response; a lower reward indicates the opposite.

Let’s now look at the model used in this phase and the corresponding data used to train it.

Model

  1. The SFT trained in phase 1 is copied.
  2. The last linear layer of SFT is cut-off (referred to as ‘unembedding’ layer in the paper).
  3. A randomly initialized linear layer is slapped on top of the model that outputs a scalar value.

So, we basically end up with a regressor!

Reward model architecture

Dataset

  1. Take the SFT model from phase 1.

2. Throw a prompt multiple times (4–9 times) at it to generate multiple responses (4–9 responses).

3. Create pairs of responses from a given prompt. C(4,2) to C(9,2) pairs per prompt.

Pairs of responses generated by the same prompt

4. Ask a human to look at each pair of responses and rank individual response within each pair. Essentially, for a given pair of responses to a prompt, indicate which of the response in the pair is better.

Human goes to each pair and generates a relative ranking for goodness of response

33k prompts were used in this step. Since 4–9 responses were used for each prompt, the total number of samples generated here were between [33k X C(4, 2)] and [33k X C(9, 2)].

Training regime

  1. Take a response pair- this pair has already been ranked as explained before.
  2. Concatenate the prompt with each of the responses in the pair. As a result, there would be one winning prompt-response and one losing prompt-response.
Ranking prompt-responses for each combination of responses

3. Pass the winning prompt-response to the reward model and generate a winning reward.

4. Pass the losing prompt-response to the same reward model and generate a losing reward.

5. Take the difference between the two rewards and update the model so as to maximize this difference.

The loss for training the reward model is:

The loss is a difference in log-odds of a high reward and low reward response. Intuitively, the reward model is learning to maximize the difference between a winning and a losing response.

Once the model is trained, one can simply concatenate the prompt with a response and get a scalar reward that indicates the degree to which a human will prefer that response for the given prompt.

Reward model inference

So, we now have a way to simulate a human’s preference for a given prompt-response combination.

3. Train a Reinforcement Learning Model (RL)

Until now, we have trained 2 models:

  1. An SFT model trained in phase 1 that is capable of generating a response given a prompt- but it still generates responses that are not aligned with human preferences.
  2. A reward model that gives out a scalar value but is incapable of actually generating any text.

So, now we have to come up with a model that utilizes the above 2 models to produce another model that is capable of generating text that aligns well with human preferences- this is the goal of the 3rd phase- training a reinforcement learning model.

Aligning generation with human preference boils down to generation of responses that will yield a high reward for a given prompt. So, while training the RL model, the aim of the model is to maximize reward. Fortunately, one can formulate this aim mathematically as an objective maximization problem:

where,

One can take the gradient of this objective function and update the parameters of the model like a regular machine learning objective function, the only difference from the usual case being the use of gradient ascent instead of gradient descent since this is a maximization problem:

With some simple mathematical aerobics, the expression for the above gradient reduces to:

where D refers to all the prompt-response pairs, x denotes token at different timesteps and the gradient-log term has a tractable value.

Let’s look at the model, the dataset and the training regime for this step.

Model

The SFT trained in phase 1 is copied.

That’s it! No modifications, just copy the SFT model.

Dataset

31k prompts- no responses are available; it’s just a collection of 31k prompts.

Training regime

  1. Throw a prompt at the model.
  2. The model generates a response.
  3. Concatenate the prompt-response and pass it to the reward model to get a reward score for the pair.
  4. Plug the values for reward and probabilitites in the gradient equation above.
  5. Use this gradient to update the model parameters.

That’s it!

BUT, as it turns out, if you use the regime above, the updates are unstable- they’re too large and generic, which makes the model prone to generating gibberish. To drive the update in a more deterministic direction, one can ensure that the updates are done such that the resulting model parameters are not too far away from the SFT trained in phase 1.

This notion of distance is applied through the use of KL divergence (in the paper at least)- which measures a proxy of distance between the SFT and the model under training (RL model). Too large a divergence indicates that the model parameters are too far away from SFT, too low indicate the opposite. KL divergence is essentially a way to compare two distributions. Remember from phase 1 that at each time step, the generative model emits a probability distribution over the whole vocabulary. These are the distributions that can be compared.

More importantly, having access to this divergence value, one can directly use it as a part of the objective function. This is accomplished as follows:

  1. For a prompt, response is generated by the RL model.
  2. At each time step of generation, a probability distribution over the vocab is generated.
  3. The same prompt-response is fed to the SFT model and a probability distribution over the same vocab for each timestep is obtained.
  4. Distributions from the above 2 steps are used to calculate KL divergence between SFT and RL model.
  5. The divergence value is subtracted from the reward generated for a prompt-response pair.

For a low divergence value, the reward has more impact on the update than for a high divergence. A hyperparameter beta is added to control the effect of the divergence term on the objective.

However, when the authors trained the model with the above cost function, they found that the model performance dropped on some datasets compared to the performance of a pretrained GPT3 on those datasets. So, if the authors hadn’t gone through all this trouble, they would still obtain a better performance on some datasets using a naive pretrained GPT3. A bummer! So, they decided to do something about it- add another term in the cost function to ensure that the RL model isn’t allowed to go very far away from the pretrained model either (this is the model that was the base for finetuning in the first phase to obtain the SFT model). The cost function now becomes:

The intuition behind the term is that the RL model should also maximize the likelihood of observing the sequences seen during the pretraining of GPT3. GPT3 was pretrained using a bunch of pretraining sequences and the objective was to maximize the likelihood of observing these pretraining sequences. If the RL model also maximizes the likelihood of observing the same sequences, one could argue that it is not very different from the pretrained GPT3. A hyperparameter gamma is added to control the effect of the pretraining term on the objective.

In conclusion, the model is optimized to:

  1. Maximize reward obtained using the reward model.
  2. Minimize divergence from SFT model.
  3. Maximize likelihood of observing sequences seen during pretraining.

After updating the RL model for some steps, one could initialize a new reward model from it, ask humans to generate another dataset following the procedure described in Phase 2 and conduct another training session for the reward model. The intuition here is that the updated RL model is now better at capturing human preference than the SFT model and hence, would make for a better initialization point for the reward model.

Is it a guarantee that the model will now generate responses that a human would prefer? No, but it’s merely more likely to produce human-preferred responses. We have just managed to reduce this gap, not close it.

And this concludes our explanation of the instructGPT paper!

Give yourself a cookie for making it till the end!

--

--

Tanuj Jain
Axel Springer Tech

I am a Machine Learning Engineer currently employed at Axel Springer, Berlin. My core interest lies with building machine learning solutions.