Fine-tune Llama 2 with SFT and DPO

4 min readAug 13, 2023

In my previous article, we discussed how to fine-tune the LLAMA model using Qlora script. However, with the latest release of the LLAMA 2 model, which is considered state-of-the-art open source model, it’s a good opportunity to fine-tune llama2 and see if we can improve our application’s performance.

In the previous fine-tuning via Qlora, there are a few drawbacks. It is written for general use cases, so the script is a bit complicated for people to read and understand. Additionally, it uses seq2seq training dataset which makes it difficult to fine-tune with custom data for unstructured documents.

So in the article, we are going to fine-tune the Llama 2 model via Huggingface’s tfl library, which has a simplified API to run the training and align with more of other transformer libraries’ APIs. And it is going to be a more developer-first approach for fine-tuning.

Current open source fine-tuning the LLM in three ways:
- Supervised Fine-tuning
- Reinforcement learning with human feedback
- Direct Preference Optimization

I won’t go through the RLHF method in the given article because it has a lot of details that need to be explained and it is more focused on research domain. Personally, I feel it is not required for fine-tuning LLM for most of us.

Let’s start with Supervised Fine Tuning, which is the most common approach that the open-source community is currently using. The main step for Supervised Fine Tuning is to load the model in 4-bit and apply the perfect config to the model for Lora training. Then we use TRL’s SFTTrainer to fine-tune models. Here is a step-by-step guide on how to code the training script:

Load the custom dataset and prepare the formatting function for processing your training data, which will be passed to your SFTTrainer as shown below.

 dataset = load_dataset("json", data_files="conversations.json",split="train")
 
 def formatting_prompts_func(example):
 output_texts = []
 
 for i in range(len(example['prompt'])):
 
 text = f"### Input: ```{example['prompt'][i]}```\n ### Output: {example['completion'][i]}"
 
 output_texts.append(text)
 
 return output_texts


trainer = SFTTrainer(
 ...
 train_dataset=dataset,
 formatting_func=formatting_prompts_func,
 ...
)

Note: In theory, you can update the formatting_prompts_func to train the data to be your document chunk instead of Q&A pairing style. This will extend the knowledge of the fine-tuned model instead of fine-tuning the model to learn the output pattern. But I haven’t done any experiments yet.

2. Load the model in 4-bit quantization.

model_name ="NousResearch/Llama-2–7b-hf"
bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, quantization_config=bnb_config)

3. Create your peft lora configuration and get the peft model.

# Change the LORA hyperparameters accordingly to fit your use case
peft_config = LoraConfig(
r=128,
lora_alpha=16,
target_modules=find_all_linear_names(base_model),
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
base_model = get_peft_model(base_model, peft_config)

4. Initiate the training argument and trainer

# Parameters for training arguments details => https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L158
training_args = TrainingArguments(
 per_device_train_batch_size=4,
 gradient_accumulation_steps=4,
 gradient_checkpointing =True,
 max_grad_norm= 0.3,
 num_train_epochs=30, 
 learning_rate=2e-4,
 bf16=True,
 save_total_limit=3,
 logging_steps=10,
 output_dir=output_dir,
 optim="paged_adamw_32bit",
 lr_scheduler_type="cosine",
 warmup_ratio=0.05,
)
trainer = SFTTrainer(
 base_model,
 train_dataset=dataset,
 tokenizer=tokenizer,
 max_seq_length=2048,
 formatting_func=formatting_prompts_func,
 args=training_args
)
trainer.train() 
trainer.save_model(output_dir)

That’s all for the Supervised Fine Tuning. I really appreciate the open source community effort to make the tech simplified and accessible to most people.

After running the SFT training for a couple of epochs, I could already see that the Llama2 model almost achieved the same performance as I have in the Llama133b model. The Llama2 is truly a significant improvement from Llama1. However, the trl library released the DPO trainer which enables aligning human expectation without RL. So why not apply DPO to our fine-tuned Llama2 model?

The DPO pipeline consists of three main steps:
— Supervised fine-tuned model
— The process of annotating data with preference labels
— DPO training

Since we already have the supervised fine-tuned model, we just need to determine the annotated data with preference labels. I used a similar approach to generate the conversation data in a previous article of Qlora llama. I asked GPT-4 to synthetically generate that labeled data :) Here is the prompt I used for that:

I need to train an LLM to do grammar correction for a non-native English speaker. Could you provide some examples for the training dataset in JSON format? The following is the example conversation.
``` 
### Input: \`\`\`for loop in js\`\`\` 
### Chosen: For loop in Javascript. 
### Rejected: for (let i = 0; i < cars.length; i++) {
 text += cars[i] + "<br>";
}
```

Once we have the dataset, we just need to process it to match the expected format of the DPO trainer. Here is an example of how we map the dataset to the desired format.

def return_prompt_and_responses(samples):
 return {
 "prompt": [
 f"### Input: ```{input}```\n ### Output: "
 for input in samples["input"]
 ],
 "chosen": samples["chosen"],
 "rejected": samples["rejected"],
 }
dataset = load_dataset("json", data_files="dpo_conversations.json",split="train")
original_columns = dataset.column_names
dataset = dataset.map(
 return_prompt_and_responses,
 batched=True,
 remove_columns=original_columns
)

Next, we need to load the model and reference model for the DPO training.

model = AutoPeftModelForCausalLM.from_pretrained(
 adapter_path,
 low_cpu_mem_usage=True,
 torch_dtype=torch.bfloat16,
 load_in_4bit=True,
)
model.config.use_cache = False
model_ref = AutoPeftModelForCausalLM.from_pretrained(
 adapter_path,
 low_cpu_mem_usage=True,
 torch_dtype=torch.bfloat16,
 load_in_4bit=True,
)

Finally, initiate training arguments and pass the peft configuration to DPO trainer.

peft_config = LoraConfig(
 r=128,
 lora_alpha=16,
 target_modules=find_all_linear_names(model),
 lora_dropout=0.05,
 bias="none",
 task_type="CAUSAL_LM",
)
dpo_trainer = DPOTrainer(
 model,
 model_ref,
 args=training_args,
 peft_config=peft_config,
 beta=0.1,
 train_dataset=dataset,
 tokenizer=tokenizer,
 max_prompt_length=1024,
 max_length=2048,
)

That’s all you need for DPO training. However, I didn’t see any significant performance improvement in my experience. But I hope the code in the article would help you to try out the DPO on your dataset and you guys can get better results.

Conclusion
Supervised fine-tuning can give the best performance results based on hundreds of examples. However, the DPO needs to carefully tweak the hyperparameters and labeled dataset, otherwise, it may end up with worse results than your pre-trained model.

PS: The full source code can be found here.

Credits:
https://huggingface.co/blog/dpo-trl

Fine-tune Llama 2 with SFT and DPO

Written by Anchen