RLHF + Reward Model + PPO on LLMs

Madhur Prashant
12 min readSep 10, 2023

--

Purpose

The purpose of this blog is a deep dive into the concepts of Reinforcement Learning with Human Feedback, the reward model, and sub concepts that work under the hood on Large Language Models (LLMs), such as the Proximal Policy Optimizer (PPO). Then we will do an some partial code walkthrough about instilling your personalized model with RLHF and your own reward model. I will then briefly do a partial deep dive into model toxicity and hallucinations and how we can create a product that is more oriented towards models or generative AI lifecycles aiming to be more helpful, honest, harmless, reliable, and aligned with the human feedback and the users in the space as well.

Generative AI Lifecycle — Human Aligned LLMs

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

In one of the blogs from before talked about the main reason why we have this generative AI project lifecycle, and how we can use is as a reference for an end to end project/product deployment purpose in the field of generative AI. Please refer to the blogs before for more information on this, but for this blog, we will be focusing on the topics of Fine-Tuning, and adapting the model to the domain that your product works in. We will touch upon and do a deep dive on making the model aligned with human feedback, using RLHF or Reinforcement Learning with Human Feedback.

Adapting and Aligning the Model

Here, the three components play a major role in making the model everything it really is. At the end of the day, we can focus on choosing the model, optimizing its deployment, but what really makes an LLM a valid aspect of your problem is the essence of it acting in a way that is accurate, in real time, reliable, aligned with the users using RLHF, and after a lot of prompt engineering (trial and error process of engineering prompts to get the correct response from the mode) and fine tuning methods that are effective and optimal (such as PEFT and LoRA [refer to the blogs from before if you don’t know what these fine tuning techniques do]).

In my opinion, when I think about working on a project or a product from the scratch that requires a generative AI based solution, I look at fine tuning, understanding the prompts, better task completions and more human and natural sounding language as a major component of the product. Fine tuning helps in making the model more domain suitable and aligning it with human feedback can help it from behaving badly, hallucinating or providing illegal assistance. Another efficient way to solve this is by using RAG and Langchain, that is supported in several ways and can use services like Amazon Bedrock in making this possible, by being a fully managed service in creating embeddings, vector stores, and then getting the response from the context and the context only, eradicating any opportunities for hallucinations.

Aligning the model with human feedback? Who cares?

Let’s take a quick use case — imagine that we are creating a model that is LLM conversational AI product that can provide therapy to humans going through harsh periods of times. Imagine if we trained the model and did not make it human aligned, and it offered illegal ways for these humans to feel better and optimal, by substance abuse, etc. That would promote Harmfulness, lack of efficient reliability and helpfulness. Models in this space, as also said by the CTO of OpenAI are catching up with these LLMs being more reliable, aligned and less hallucinating and the only way that is possible is by using human feedback from a diverse set of population, and other ways such as RAG, Langchain, to provide context based responses. This step in the generative AI lifecycle maximizes helpfulness, minimizes hard and avoids the discussions and engagements with dangerous topics.

Before, we get into the deep dive of RLHF, if you don’t know much about reinforcement learning, I would recommend you to go over it but essentially, as given below in the image, reinforcement learning specifically involves an action being taken, putting the environment in a state that is evaluated on a reward ‘r’.

Here, we might use something called an “RL Policy (Model)”, that is a model that learns the actions that leads to positive rewards. Based on all this data, the policy evolves over several epochs or iterations to understand all possible strategies. These LLMs go over these “Rollouts” that are iterations.

Where does RL fit in with LLMs? Here is where

Here, we have the agent and the environment, and in the contexts of LLMs, and in this case the policy is our instruct LLM model that is pre trained or fine tuned from before, and now we want to be able to generate text in a given domain right? So we take the action, the LLM takes the current context window and the environment context, and based on the action, we get a reward. We get to the policy with the reward and this is where the human feedback comes in the place.

At scale, we cannot use humans for this task all the time, and this is what a reward model does:

Reward Model

Now based on the human feedback, we create a reward model, that acts as a trained model, and the model makes the call in the RLHF process and assigns different rewards to different prompt completions without the humans in the loop. This whole iteration is called a “Rollout”.

How to prepare a dataset with human feedback?

Now, this is an essential part of this rollout process, that makes a model everything it needs to be in terms of reliability and human aligned. Above, you can see how the instruct or the initial LLM takes in the prompt data set, that leads to text generation from the prompts given into the initial LLM, and now these generated texts are scored by humans and based on how the outputs are ranked by humans from the mode, the reward model is formed based on the training of the human feedback.

I have always wondered, what if we wanted to create a model that does several tasks where different domains of human feedback is needed and we do not want to mix anything up with different domain feedbacks together? We would probably create a serial pipeline of reward models (just like a serial pipeline of models on sagemaker) working together and the reward model that would be in use would be invoked to participate in the rollout or prompt completion and give the response. That would be pretty cool and efficient, right?

But nevertheless, to get this dataset, we define our model alignment criteria based on product and business goals, and based on the prompts we give to our ‘Instruct LLM’. Here we get human labelers and label the prompts generation based on the success metrics (such as the 3H’s). I have talked about the human aligned models in the previous blogs — if you want to dive deep into this, please refer to those.

One thing to note here is the importance of normalizing our data and rankings from the human labelers that we are supposed to use right? Here, we take the prompts, completions, the ranks by the humans, and then we get all the combinations followed by the rewards from the rankings.

If you are creating your own customized model, you need to have the preferred output into your model first.

Now that we have the dataset ready with all of the human feedback, let’s go ahead and do something super cool! Train our reward model, which acts as the heart of the rollout process in making the model more human aligned.

Reward Model Creation & Training

Reward models assign the rewards during the RLHF process and stores the rewards. We give in the prompt completions to the reward model and train the reward model to predict the preferred response.

  • * Feed in the preferred inputs into the reward model first to help it get trained on those responses as seen in the image below.

Pre Train Reward Model To Reduce Toxicity

We can use the reward model, feed new prompt completion pairs, and act as a binary classifier in classifying the logits as positive or not positive and we use these logits later as the rewards.

Fine Tuning with RLHF (PPO & KL Divergence)

This is my personal favorite part. We can fine tune the model all we want, but what is fine tuning without RLHF — nothing! The future of Generative AI lies in the hands of reliability, helpfulness, truthfulness, and fine tuning the model with RLHF is the way to go. Let’s talk about it:

Here, in sum (images from google search and deeplearning.ai):

  1. We have a prompt dataset that we start with, and we pass it into something that is our initial LLM.
  2. We take the prompts, feed it into the instruct LLM, get the responses and then
  3. Feed the prompt completions into the trained reward model from before. Now that we have the reward model, it scores the completion and sends it into our Reinforcement learning algorithm
  4. The RL algorithm here that we are using is the Proximal Policy Optimization, that provides several prompt completion experiments, ranks the average, uses back propagation to evaluate the response and then sends the most optimal response to our instruct LLM.
  5. We do this rollout for several iterations and get something that is called a reward model but there is one down side to this.

What if our model keeps getting trained on positive values and then starts providing outputs that are bizarre, vague and not human aligned?

That is where we use the following solution to this limitation:

We use the reference model first, freeze all of weights in it to act as a point of reference for our human aligned model, and then based on the shift, we use something called a KL divergence penalty that gets added to the reward so that when a model hallucinates, it brings the model back close to the reference model, to provide positive, but not bizarre positive responses. We can use a PEFT adapter (parameter efficient fine tuning — view my earlier blogs for this) to train our PPO model and get the model to be more and more aligned as rollouts occur.

Code Walkthrough — Reward Model, PPO, Toxicity

UTILIZING PEFT + LORA + PPO: Evaluating and Eradicating Toxicity from Model, Making it more human aligned and follow the 3H’s (Helpful, Honest, Harmless)

First, let’s go ahead and set up the kernel and install the dependencies that we need to get this started:

%pip install --upgrade pip
%pip install --disable-pip-version-check \
torch==1.13.1 \
torchdata==0.5.1 --quiet
%pip install \
transformers==4.27.2 \
datasets==2.11.0 \
evaluate==0.4.0 \
rouge_score==0.1.2 \
peft==0.3.0 --quiet
# Installing the Reinforcement Learning library directly from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd

Now we will import some components that will be needed in loading our LlaMa 2 model, prepare our reward model and get our toxicity evaluator in place:

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

Loading our Model (Let’s experiment): LlaMa 2

Base ctransformers with no GPU acceleration
!pip install ctransformers>=0.2.24
Or with CUDA GPU acceleration
!pip install ctransformers[cuda]>=0.2.24
Or with ROCm GPU acceleration
!CT_HIPBLAS=1 pip install ctransformers>=0.2.24 —no-binary ctransformers
Or with Metal GPU acceleration for macOS systems
!CT_METAL=1 pip install ctransformers>=0.2.24 —no-binary ctransformers

Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-34b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-34b-Instruct-hf")
huggingface_dataset_name = "knkarthick/dialogsum"dataset_original = load_dataset(huggingface_dataset_name)dataset_original

Preprocessing a part of the dataset below

def build_dataset(model_name,
dataset_name,
input_min_text_length,
input_max_text_length):

“””
Preprocess the dataset and split it into train and test parts.

Parameters:
- model_name (str): Tokenizer model name.
- dataset_name (str): Name of the dataset to load.
- input_min_text_length (int): Minimum length of the dialogues.
- input_max_text_length (int): Maximum length of the dialogues.

Returns:
- dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
“””

# load dataset (only “train” part will be enough for this lab).
dataset = load_dataset(dataset_name, split=”train”)

# Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
dataset = dataset.filter(lambda x: len(x[“dialogue”]) > input_min_text_length and len(x[“dialogue”]) <= input_max_text_length, batched=False)

# Prepare tokenizer. Setting device_map=”auto” allows to switch between GPU and CPU automatically.
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map=”auto”)

def tokenize(sample):

# Wrap each dialogue with the instruction.
prompt = f”””
Summarize the following conversation.

{sample[“dialogue”]}

Summary:
“””
sample[“input_ids”] = tokenizer.encode(prompt)

# This must be called “query”, which is a requirement of our PPO library.
sample[“query”] = tokenizer.decode(sample[“input_ids”])
return sample

# Tokenize each dialogue.
dataset = dataset.map(tokenize, batched=False)
dataset.set_format(type=”torch”)

# Split the dataset into train and test parts.
dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

return dataset_splits

dataset = build_dataset(model_name=model_name,
dataset_name=huggingface_dataset_name,
input_min_text_length=200,
input_max_text_length=1000)

print(dataset)

Function source (deeplearning.ai)

Now, preparing our function to extract the model parameters(source — deeplearning.ai)

def print_number_of_trainable_model_parameters(model):
trainable_model_params = 0
all_model_params = 0
for _, param in model.named_parameters():
all_model_params += param.numel()
if param.requires_grad:
trainable_model_params += param.numel()
return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

Add the adapter to the original salesforce code generation model. Now we will need to pass them to the constructed PEFT model, also putting is_trainable=True.

lora_config = LoraConfig(
r=32, # Rank
lora_alpha=32,
target_modules=["q", "v"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, 
torch_dtype=torch.bfloat16)
peft_model = PeftModel.from_pretrained(model,
'/kaggle/input/generative-ai-with-llms-lab-3/lab_3/peft-dialogue-summary-checkpoint-from-s3/',
lora_config=lora_config,
torch_dtype=torch.bfloat16,
device_map="auto",
is_trainable=True)
print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
torch_dtype=torch.bfloat16,
is_trainable=True)
print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)
ref_model = create_reference_model(ppo_model)
print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Using the Meta AI’s RoBERTa-based hate speech model for the reward model. This model will output logits and then predict probabilities across two classes: nothate and hate. The logits of the output nothate will be taken as a positive reward. Then, the model will be fine-tuned with PPO using those reward values.

toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

Here, we could try to have our code llama mode do code generation on the hate speech model and evaluate the toxicity as follows:

toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

Proposed hate speech code example

non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."
toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')
# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')
# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

You can also evaluate the toxicity as follows:

toxicity_evaluator = evaluate.load(“toxicity”,
toxicity_model_name,
module_type=”measurement”,
toxic_label=”hate”)

Conclusion

I know we did not cover the end to end code walkthrough, but it is essential to realize the importance of evaluating toxicity of the model using PPO, and helping the model being aligned using KL divergence. I am planning to do a deep dive on an end to end code example with Code LlaMa and how we can reduce it’s hallucinations using RAG and langchain in the next blog — if you are looking for something else or want to collaborate, reach out!

--

--

Madhur Prashant

Learning is my passion, so is the intersection of technology & strategy. I am passionate about product. I work @ AWS but these are my own personal thoughts!