Direct Preference Optimization (DPO) in LLMs

Rishi
4 min readJan 18, 2024

--

Introduction

Language models have revolutionized natural language processing (NLP) by learning broad world knowledge and reasoning skills. However, controlling the behavior of these models to align with specific preferences has been a challenge. Existing methods, such as Reinforcement Learning from Human Feedback (RLHF), require complex procedures and often result in unstable performance. In this article, we will explore a new parameterization called Direct Preference Optimization (DPO) that offers a stable, performant, and computationally lightweight approach to fine-tuning language models.

Understanding the Challenges of RLHF

Reinforcement Learning from Human Feedback (RLHF) has been a popular method for aligning language models with human preferences. It involves training a reward model that reflects human preferences and then fine-tuning the language model using reinforcement learning to maximize this estimated reward. However, RLHF is complex and often unstable, requiring significant hyperparameter tuning and sampling from the language model. This complexity hinders its widespread adoption and efficiency.

What is Direct Preference Optimization (DPO)?

Direct Preference Optimization (DPO) is a improved approach to fine-tuning language models that eliminates the need for reward model fitting and extensive hyperparameter tuning. Unlike RLHF, DPO formulates the constrained reward maximization problem as a simple classification loss. This allows for the extraction of the corresponding optimal policy in closed form, resulting in a stable and efficient algorithm.

The Two Stages of DPO

The DPO pipeline consists of two main stages: Supervised Fine-tuning (SFT) and Preference Learning.

Supervised Fine-tuning (SFT)

In the initial stage of DPO, the language model is fine-tuned on a dataset of interest. This dataset provides a clear mapping between specific inputs and the desired outputs. Supervised fine-tuning ensures that the model’s responses align more closely with specific requirements and preferences. For example, a conversational AI model could be finetuned to provide user-friendly and detailed responses that resonate with a company’s tone and branding.

Preference Learning

After the supervised fine-tuning stage, the model undergoes preference learning using preference data. Preference data consists of curated options or alternatives related to a specific prompt. Annotators rank these options based on human preferences, providing insights into fine-tuning the model to produce outputs that align with human expectations.

The Simplicity of DPO

The simplicity of DPO lies in its direct definition of the preference loss as a function of the policy, eliminating the need for training a reward model first. During the fine-tuning phase, DPO treats the language model itself as the reward model. It optimizes the policy using a binary cross-entropy objective, leveraging human preference data to determine preferred responses. By comparing the model’s responses to the preferred ones, the policy is adjusted to enhance its performance.

Implementing DPO with TRL

To implement Direct Preference Optimization, the TRL (Transformer Reinforcement Learning) library provides a streamlined approach with its DPO Trainer. The following steps outline the process:

Supervised Fine-Tuning (SFT):
Begin by training the SFT model using a dataset that is in-distribution and relevant to the task at hand. This step sets the stage for the DPO algorithm to work effectively.

from transformers import AutoModelForCausalLM
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("your-domain-dataset", split="train")
model = AutoModelForCausalLM.from_pretrained("your-foundation-model-of-choice")
trainer = SFTTrainer(model, train_dataset=dataset, dataset_text_field="text", max_seq_length=512)
trainer.train()

Understanding the Dataset Format:
The DPO trainer requires a specific dataset format that reflects the preference between two sentences. The dataset should include entries for the prompt, chosen responses, and rejected responses.

dpo_dataset_dict = {
"prompt": ["hello", "how are you", ...],
"chosen": ["hi, nice to meet you", "I am fine", ...],
"rejected": ["leave me alone", "I am not fine", ...],
}

Leveraging the DPOTrainer:
Initialize the DPOTrainer, specifying the model to be trained, a reference model (used to compute implicit rewards), the beta hyperparameter for the implicit reward, and the dataset.

dpo_trainer = DPOTrainer(
model,
model_ref,
args=training_args,
beta=0.1,
train_dataset=train_dataset,
tokenizer=tokenizer,
)

Training and Monitoring:
Begin the training process with dpo_trainer.train() Monitor the model’s performance using reward metrics such as rewards/chosen, rewards/rejected, rewards/accuracies, and rewards/margins.

dpo_trainer.train()

Public Datasets for Preference Data

To create preference data for fine-tuning language models, several public datasets are available:

  1. OpenAI WebGPT Comparisons: This dataset provides comparisons, including a question, a pair of model answers, and human-rated preference scores for each answer.
  2. OpenAI Summarization: This dataset offers text summarization examples, consisting of human-written responses and human-rated model responses.
  3. Reddit ELI5: Sourced from Q&A subreddits, this dataset contains questions, answers, and scores.
  4. Human ChatGPT Comparison Corpus (HC3): This dataset provides human and ChatGPT answers for various questions.

Advantages of DPO over RLHF

Direct Preference Optimization (DPO) offers several advantages over Reinforcement Learning from Human Feedback (RLHF). DPO is stable, performant, and computationally lightweight, eliminating the need for reward model fitting and extensive hyperparameter tuning. Additionally, DPO has shown promising results in controlling sentiment, improving response quality in summarization and single-turn dialogue, and simplifying the implementation and training process.

Conclusion

Direct Preference Optimization (DPO) presents a new approach to fine-tuning language models, addressing the challenges faced by Reinforcement Learning from Human Feedback (RLHF). By formulating the reward maximization problem as a classification loss, DPO offers a stable and efficient method to align language models with human preferences. With the simplicity and effectiveness of DPO, the journey of training and optimizing language models becomes smoother and more efficient.

--

--