Direct Preference Optimization of your LLM

Sharath S Hebbar
2 min readFeb 18, 2024

--

GitHub LinkedIn Medium Portfolio Substack

Direct Preference Optimization.

Introduction

Direct preference optimization (DPO) has emerged as a promising alternative for aligning Large Language Models (LLMs) to human or AI preferences.

Direct preference optimization (DPO) on Large Language Models (LLMs) is an approach aimed at fine-tuning these models to generate text that aligns with specific preferences or objectives provided by users. Unlike traditional fine-tuning methods that rely on labeled data or reinforcement learning techniques, DPO directly optimizes the model’s parameters to satisfy user-defined preferences.

Here we will discuss DPO on Open Source LLM SSH_355M which is a Supervised Fine Tuned version of GPT2 Medium freely hosted on HuggingFace Platform.

Hugging Face

Datasets

Dataset preview

DPO expects the dataset to have 2 preferred columns they are chosen and rejected so make sure the dataset looks like the above dataset.

Code


!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl
import torch
import gc
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
DataCollatorForLanguageModeling,
Trainer,
TrainingArguments,
)
from datasets import load_dataset
from trl import DPOTrainer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
def clean():
gc.collect()
torch.cuda.empty_cache()

clean()
# Select your model
model_name = "Sharathhebbar24/SSH_355M"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Dataset for DPO
dataset_name = "Intel/orca_dpo_pairs"
dataset = load_dataset(dataset_name, split=split)
num_rows = dataset.num_rows
print(dataset.to_pandas().head())
# Training Arguments

batch_size = 2
max_steps = 100
training_args = TrainingArguments(
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=batch_size,
gradient_checkpointing=True,
learning_rate=2e-5,
lr_scheduler_type="cosine",
max_steps=max_steps,
save_strategy="no",
logging_steps=50,
output_dir="./models/dpo/",
warmup_steps=max_steps//4,
fp16=True,
)
# Create DPO trainer
max_prompt_length = 512
max_length = 1024
dpo_trainer = DPOTrainer(
model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
beta=0.1,
max_prompt_length=max_prompt_length,
max_length=max_length,

)

# Fine-tune model with DPO
dpo_trainer.train()
new_model = "<Name of the model>"
HF_TOKEN = "<Your HF Token>"

tokenizer.push_to_hub(
new_model,
token=HF_TOKEN
)

model.push_to_hub(
new_model,
token=HF_TOKEN
)

Reference

  1. Github: https://github.com/SharathHebbar/dpo_chatgpt2
  2. Paper: https://arxiv.org/pdf/2305.18290.pdf
  3. Dataset which can be directly used for DPO: Sharathhebbar24/orca_dpo_pairs
  4. Medium: https://medium.com/@sharathhebbar24/direct-preference-optimization-of-your-llm-be75f380b59e

--

--

Sharath S Hebbar

Data Science | Machine learning | Artificial Intelligence | Cloud | Internet of Things | Statistics