Comparative Study: Training OPT-350M and GPT-2 on Anthropic’s HH-RLHF Dataset Using Reward-Based Training

3 min readSep 11, 2023

Introduction

In the rapidly evolving field of Natural Language Processing, assessing model performance on specific datasets is crucial to ascertain their adaptability and efficiency. This study focuses on comparing two renowned models: Facebook’s opt-350m and GPT-2. Both models are trained and tested on the Anthropic/hh-rlhf dataset having 160800 rows, offering insights into their respective performances and potential strengths.

Data Preprocessing and Reward Modeling

The foundation of any ML experiment is accurate data processing. The Anthropic/hh-rlhf dataset comprises paired samples of "chosen" and "rejected" sequences. We will define a preprocessing function which will tokenize, truncate and format the sequences to maintain a uniform structure.

TRL supports custom reward modeling to perform reward modeling on dataset and model. The reward trainer expects a very specific format for the dataset. Since the model will be trained to predict which sentence is the most relevant, given two sentences.

The entries should be named

input_ids_chosen
attention_mask_chosen
input_ids_rejected
attention_mask_rejected

The j and k suffixes are used to denote the two sentences in the paired dataset.

def preprocess_function(examples):
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
    }
    for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
        tokenized_j = tokenizer(chosen, truncation=True)
        tokenized_k = tokenizer(rejected, truncation=True)

        new_examples["input_ids_chosen"].append(tokenized_j["input_ids"])
        new_examples["attention_mask_chosen"].append(tokenized_j["attention_mask"])
        new_examples["input_ids_rejected"].append(tokenized_k["input_ids"])
        new_examples["attention_mask_rejected"].append(tokenized_k["attention_mask"])

    return new_examples

With the processed and formatted dataset, the RewardTrainer works similarly to Hugging Face’s Trainer. It’s best to use an AutoModelForSequenceClassification as the reward model which is trained on a dataset of paired examples, where each example is a tuple of two sequences, training it to identify the most relevant sequence in a given pair for a task at hand.

To ensure our models only consider sequences within a specific length, we applied a mapping and filtering:

train_dataset = train_dataset.map(preprocess_function, batched=True, num_proc=4)
train_dataset = train_dataset.filter(lambda x: len(x["input_ids_chosen"]) <= 512 and len(x["input_ids_rejected"]) <= 512)

Quantizing the Model

Quantization can reduce model size and inference time without a significant decrease in performance. For this experiment, the opt-350m and GPT-2 model was quantized to a 4-bit precision:

from transformers import AutoModelForSequenceClassification, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=False,
    load_in_4bit=True
)

model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    quantization_config=quantization_config,
    device_map={"": 0},
    trust_remote_code=True,
    num_labels=1,
)
model.config.use_cache = False

Setting Up Reward-based Training

The aim was to fine-tune the models using rewards. This is where the trl library comes into play. We used the RewardTrainer class:

trainer = RewardTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
    max_length=512,
)

With the PEFT configuration, specific parameters like r, lora_alpha, and others were set to optimize the training process.

Insights & Results

Training Loss: From the results, it’s evident that both models converge towards a similar loss by the 10,000th step. However, the trajectory of the decline in loss differs. opt-350m starts with a lower initial loss, while GPT-2, despite starting with a higher initial loss, reduces it at a more gradual pace. This suggests that the OPT-350M might have an initial advantage in terms of understanding or adapting to the dataset, but both models eventually reach a similar level of performance.
Training Time: GPT-2 demonstrated a quicker training speed, completing in 33:18, while opt-350m required 1:23:54. The difference in training duration might be attributed to the inherent architecture of each model. GPT-2’s design may be inherently more optimized for tasks resembling those in the Anthropic/hh-rlhf dataset. Alternatively, GPT-2 might be better suited or more compatible with the specific characteristics of the Anthropic/hh-rlhf dataset, leading to quicker convergence.

Future Exploration

It would be worthwhile to delve deeper into why such a significant difference in training time exists, considering both models are prominent in the NLP field. This could lead to insights about model architecture, optimization, or even dataset-specific peculiarities.

Conclusion

In our study comparing opt-350m and GPT-2 on the Anthropic/hh-rlhf dataset, distinct training patterns emerged. While opt-350m experienced a rapid initial decline in loss, GPT-2 showed a steadier descent but trained faster overall. This difference underscores the unique efficiencies of each model and their interaction with the dataset. It’s essential to recognize that our observations depend on various factors, including dataset choice and preprocessing. Reward-based training emerges as a potent technique in language modeling, highlighting the need to match model strengths to specific tasks for optimal outcomes.