RLHF Reward Model Training

Published in

Towards Generative AI

5 min readAug 11, 2023

A popular technique to fine-tune large language models with human feedback is called reinforcement learning from human feedback, or RLHF for short.

The LLM weight updation in RLHF is driven by the reward(feedback) that a user gives for the completion generated by the LLM. Determining the reward is a complicated task. One way to do this is to have a human evaluate all the completions of the model against some alignment metric, such as determining if the output is helpful or not. This feedback is a scaler quantity. The LLM weights are then updated iteratively to maximize the
reward obtained from the human classifier.

Data Collection

Obtaining human feedback is time-consuming and expensive. As a workaround, we can train another model called Reward Model, as a proxy for human feedback. The goal of a reward model is to evaluate the degree of alignment a model response has with human preferences. On a simpler note, a reward model is a model that takes (prompt, response) pair as input and outputs a reward/score as output. This can be formulated as a simple regression or classification task. The real challenge in building such a model is good quality dataset. The perception of good/bad differs from person to person and to map it to a scaler quantity is infeasible.

One workaround is to ask labelers to compare two responses and decide which one is better. This kind of dataset is called a comparison dataset, each record comprises (prompt, chosen response, rejected response).

Training

To train a reward model the comparison dataset should be in the format of (prompt, chosen response, rejected response ), i.e. the better option comes first. The ordering is crucial because it is the base assumption while designing the loss function for the reward model. Any model that can take in variable length text input and output a scaler value can be used. Typically we use an SFT model which aligns with our task and remove the last de-embedding layer while adding a single neuron in the last layer for the scaler output.

For every epoch, we do two passes of the model. In the first pass, we feed in prompt and chosen response to the Reward Model, the output is Rchosen. In the second pass, we feed in the same prompt along with the rejected response. The output, in this case, is Rrejected. Next, we use the loss function defined below to update the reward model.

The intuition behind the loss function is to maximize the gap between chosen response score and rejected response score. For a very high reward score for chosen response and a low reward score for rejected response, the loss would be 0.

TRL Custom Reward Modeling

Reward models are a proxy to human feedback, that takes in a (prompt, response) pair as input and returns a score based on human preference.TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.

Import necessary packages and libraries

import random
import pandas as pd
from operator import itemgetter
import torch
import warnings
warnings.filterwarnings('ignore')
from datasets import Dataset, load_dataset
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments
from trl import RewardTrainer

Comparison Dataset

In this section, we convert the dataset in the form of (question, answer, feedback) tuple to a comparison dataset (question, chosen answer, and rejected answer). Reward model training requires the data to be in the form of (question, chosen answer, rejected answer) tuple.

The feedback.csv file contains multiple answers for a given question rated by a human. These ratings could be for any human value that we want to be included in the model output.

For eg: If we want the model outputs to be of helpful nature answers, we can instruct annotators to give high rewards(feedback score) for a helpful answer in comparison to other answers for a particular question.

df = pd.read_csv('feedback.csv')
df.head()

Once we have the feedback received from the users, we can then convert this dataset into a comparison dataset for reward model training

df['tup'] = list(zip(df['answer'], df['feedback']))
#grouping together all the answers for a given question along with its feedback
df_g = df.groupby('question')['tup'].apply(list).reset_index()
# sort each group based on the feedback score
df_g["sorted_tup"] = df_g["tup"].apply(lambda x :sorted(x,key=itemgetter(1)) )
# answer with highest feedback score is "chosen"
df_g["chosen"] = df_g["sorted_tup"].apply(lambda x: x[-1][0])
df_g["chosen_score"] = df_g["sorted_tup"].apply(lambda x: x[-1][1])
# answer with highest feedback score is "rejected"
df_g["rejected"] = df_g["sorted_tup"].apply(lambda x: x[0][0])
df_g["rejected_score"] = df_g["sorted_tup"].apply(lambda x: x[0][1])
df_g = df_g.dropna()
df_g = df_g[(df_g['chosen_score']>=4.0) & (df_g['rejected_score']<4.0)]

rows = []
for record in df_g.itertuples(index=True, name='Pandas'):
    if record is None or len(record) == 0:
        continue
    rows.append({
        "instruction": record.question,
        "chosen_response": record.chosen,
        "rejected_response": record.rejected
    })
prepared_dataset = Dataset.from_list(rows)
prepared_dataset.to_pandas()

Train the reward model with TRL

In this example, we will finetune “distilroberta-base” model. The formatting_func function combines instructions with chosen and rejected responses, creating two new strings. These strings are tokenized, becoming input for a reward model that learns to distinguish between good and bad responses based on these examples. The loss function is designed in a way that maximizes the difference between the score of chosen and rejected responses. We use trl’s RewardTrainer to finetune the base model. It is a subclass of the transformers.Trainer class and inherits all of its attributes and methods.

#Select a base model whch we need to train for reward modeling.
model_name = "distilroberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

def formatting_func(examples):
    kwargs = {"padding": "max_length", "truncation": True, "max_length": 512, "return_tensors": "pt"}
    prompt_plus_chosen_response = examples["instruction"] + "\n" + examples["chosen_response"]
    prompt_plus_rejected_response = examples["instruction"] + "\n" + examples["rejected_response"]
    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)
    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0], "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0], "attention_mask_rejected": tokens_rejected["attention_mask"][0]
    }formatted_dataset = prepared_dataset.map(formatting_func)
formatted_dataset = formatted_dataset.train_test_split()# Configuring the training arguments
training_args = TrainingArguments(
    output_dir="./reward_model",
    per_device_train_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=1,
    num_train_epochs = 10,
    report_to=None,
)# Loading the RewardTrainer from TRL
trainer = RewardTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
)
trainer.train()

Conclusion

In this blog, we have seen how RewardTrainer can train a custom reward model on our own feedback data. The reward model should be trained on a dataset of paired examples, where each example is a tuple of two sequences.

Implementation of reward model training using the TRL library can be found at: Github

References :

Follow Towards Generative AI for more on the latest in AI.