Reinforcement Learning from Human Feedback (RLHF): Empowering ChatGPT with User Guidance

Zain ul Abideen
6 min readJun 26, 2023

Transition from GPT-3.5 to ChatGPT

Introduction

In this blog post, I will be discussing how OpenAI transitioned from GPT-3.5 models to ChatGPT. I will be explaining the concept of Reinforcement learning from human feedback and how it has helped the model in making non-toxic and factual outputs. In the blog post Autoregressive Models for Natural Language Processing, I discussed the autoregressive nature of Generative pre-trained transformers and their architectural details. This post is an advancement to that blog. So if you haven’t checked out the previous post, go check it out. After GPT-3 release in 2020, the OpenAI team worked on a series of models they call as GPT-3.5 model series. They trained the model on a mixture of text and code. There are four models included in this series. They are code-davinci-002, text-davinci-002, text-davinci-003, and gpt-3.5-turbo-0301. GPT-3.5 turbo was an improvement on text-davinci-003 and it was optimized for chatting. So, the major change that the OpenAI team made was the use of Reinforcement learning. Let me explain the basics of Reinforcement learning first of all.

Reinforcement Learning

Reinforcement learning is a branch of machine learning in which we have to train an agent to find a policy (set of steps) that is optimal and maximizes the cumulative reward. Let me explain it with the help of an example shown below.

RL

The agent is present at location (1,1). It can take one step at a time i.e. one action. If it gets to peach (2,3), it will get an award of +5 but if it gets to apple (3,3) it will get an award of +10. The catch is that for every step that the agent takes into an empty cell, the agent will lose one point (-1). The agent has to effectively devise an optimal policy (set of actions) that maximizes the cumulative award. This simulation ends when the agent reaches any fruit. This entire simulation is run a lot of times until the model learns to maximize awards. Two policies are shown below:

Two Policies in RL

The cumulative award for the left policy is +3 and for the right policy is +5. So the right policy is the optimal policy. Reinforcement learning is especially suitable for problems with sequential decision-making, where actions influence subsequent states and rewards.

Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) refers to a specific approach in reinforcement learning where human guidance is incorporated into the learning process. RLHF aims to leverage the expertise and knowledge of human demonstrators to accelerate and improve the training of reinforcement learning agents. I’ll be explaining the entire RLHF process used in ChatGPT in 3 steps.

Step 1: Supervised Fine-tuning of GPT-3.5

In the first step, a prompt dataset is formed which consists of prompts from various domains. Then we take a prompt one by one and provide it to a labeler that will figure out the most desirable output for that prompt. Then the prompts and these human labels are combined to form a new dataset which is used by pre-trained GPT-3.5 for fine-tuning. This helps the model in learning what kind of outputs humans expect and desire.

SFT

Step 2: Training a Reward model

In the second step, we provide the language model with a prompt and extract several outputs from it. The model is able to produce different outputs from a single prompt due to different decoding strategies. Greedy method always outputs the word with the highest probability. Top-k method randomly chooses a word from a pool of k-words with the highest probabilities. The nucleus sampling randomly chooses a word from a pool of words whose probabilities sum up to p. Temperature is a value between 0 and 1 which controls the stochasticity of a model. A temperature of 1 will give more random outputs while a temperature of 0 will be greedy output. OpenAI playground has a nice representation to learn decoding strategies at the output.

Reward model training

After the model has produced multiple outputs, a labeler will fill out a form shown below for each output. The labeler will give a rating to the output and answer a few categorical questions. These categorical questions tell what was wrong with the output. Which ethical consideration was this output violating? In this way, all the responses will be ranked from best to worst.

Rating and Likert scale

All these labels and responses from the model are now used to train a rewards model. The reward model will take two responses from one prompt and calculate a reward r for each response. The loss function for this model is calculated based on human labels and rewards assigned by the reward model. If the first response is better i.e. reward is more for the first response then the loss will be low but if the second response is better then loss will be high.

Loss calculation of Reward model

Step 3: Updating policy using PPO

In the third step, we input a new prompt to the fine-tuned GPT-3.5 obtained from the first step. This model will generate a response for this prompt. We will take this prompt and response, and use it as input to our trained Reward model from the second step. The reward model will a reward value to the response. We will use this reward to train our fine-tuned GPT-3.5. The model has to learn to maximize the reward value.

Updating policy

The fine-tuned GPT-3.5 is updated by using the reward form reward model. The model is updated with the help of Proximal policy optimization. The goal of PPO is to maximize the total reward of responses generated from the model by including reward in the loss.

PPO loss

The above PPO loss function consists of two main components: a surrogate objective function and a clipping mechanism. The surrogate objective function measures the policy’s performance and guides the updates of parameters, while the clipping mechanism controls the extent of policy updates to ensure stability.

Closing Remarks

In conclusion, the incorporation of reinforcement learning from human feedback (RLHF) marks a significant advancement in the field of Natural language processing. By harnessing the expertise and guidance of human demonstrators, RLHF has the potential to revolutionize the training and behavior of intelligent agents. Learning from the feedback of humans has led ChatGPT to produce more desirable, non-toxic, and factual outputs. In the next blog post, I will be covering in detail the Parameter efficient fine-tuning (PEFT) techniques like LoRA, Prefix Tuning, P-Tuning, and QLoRA.

Thank you for reading!

Follow me on LinkedIn!

--

--