Democratizing AI! Fine-Tuning LLaMA 2🦙: A Step-by-Step Instructional Guide
Many might consider ChatGPT the definitive AI story of 2023, but is that an accurate assessment? Credit must be given where it’s due: OpenAI’s ChatGPT has been recognized as the fastest-growing consumer app ever, amassing 100 million monthly users in just two months. It’s ChatGPT that catapulted Generative AI into the mainstream, igniting widespread interest in generative AI technologies. For context, it took Facebook four years to reach the same number of monthly users. While some advocate for a ‘slow and steady’ approach, might a ‘Llama’ 🦙 be the dark horse in this race?
Earlier this year, Meta released the first iteration of LLaMA in February, followed by the commercial LLaMA 2 in July. As the first major open-source large language model (LLM), LLaMA has fostered a burgeoning open ecosystem. Meta reports that the open-source community has fine-tuned and released over 7,000 LLaMA derivatives on Hugging Face. This democratization is pivotal, as previously only well-resourced companies and research institutions could afford to fine-tune or train such models.
The LLaMA models are designed to address this gap, offering a commercial license to broaden accessibility. Innovations now enable fine-tuning on consumer GPUs with limited memory, further democratizing AI. By removing barriers to access, even smaller businesses can create bespoke models suited to their unique needs and budget constraints.
In this tutorial, we will delve into LLaMA 2, guiding you through each step to fine-tune it on your custom dataset. But first, let’s clarify some essential terminology to lay the groundwork for what’s to come.
What is Fine-Tuning? PEFT? LoRA? QLoRA?
Fine-tuning every parameter across all layers of a model certainly delivers impressive results, but it’s also a major resource hog. It demands a hefty amount of GPU power and a fair bit of time.
That’s where PEFT, or Parameter Efficient Fine Tuning, enters the picture. This approach is far more economical in terms of resources and costs. Two key methods under PEFT are LoRA (Low Rank Adaptation) and its more efficient variant, QLoRA (Quantized LoRA). These methods involve loading pre-trained models with quantized 8-bit and 4-bit weights onto the GPU. Interestingly, it’s quite feasible to fine-tune the Llama 2–13B model using LoRA or QLoRA on a standard 24GB consumer GPU. Notably, QLoRA is even more efficient, requiring less GPU memory and shortening fine-tuning time compared to LoRA.
The smart move is usually to start with LoRA or QLoRA (if resources are really tight), and then assess the performance. Full-scale fine-tuning should only come into play if these initial results aren’t up to par.
What is quantization?
Quantization is a method where you convert those hefty 32-bit floating-point model weights into more compact formats like 16-bit floats, 16-bit ints, or even down to 8, 4, 3, or 2-bit ints. This nifty trick shrinks the model’s size, speeds up both fine-tuning and inference. It’s super crucial when you’re working in environments with limited resources, like on a single GPU, a Mac, or mobile devices (like what you see over at [https://github.com/ggerganov/llama.cpp]). Without quantization, tuning the model or running inference in these scenarios would be a tough nut to crack. If you’re looking to dive deeper into quantization, there’s more info to check out here & here.
Fine-Tuning LLaMA 2: A Step-by-Step Instructional Guide
- Getting Started: Setting Up Your Environment
Let’s begin by setting up our environment. This involves installing the necessary libraries and importing the required modules to ensure a smooth fine-tuning process with LLaMA 2.
For this tutorial, I utilized a V100 High-RAM GPU, which provided robust performance for running the Google Colab notebook alongside the LLaMA 2 -7B-chat model. Below are the commands to install the essential libraries:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 --progress-bar off
import json
import re
from pprint import pprint
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from trl import SFTTrainer
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
2. Selecting and Cleaning our Dataset
For this tutorial, we’re utilizing the ‘Dialog Studio’ dataset from Salesforce, available on Hugging Face. This collection is a rich resource for conversational AI research, and specifically, we’ll focus on the ‘TweetSumm’ subset.
The ‘TweetSumm’ dataset is structured as follows (for detailed information, please refer to the provided link):
dataset = load_dataset("Salesforce/dialogstudio", "TweetSumm")
dataset
Data Pre-processing
Our first step is to preprocess the data, extracting meaningful conversations. The pre-processing involves creating a training prompt and cleaning the text to enhance the dataset’s usability. Here’s how we approach it:
We define a default prompt to guide the summarization task.
DEFAULT_SYSTEM_PROMPT = """
Below is a conversation between a human and an AI agent. Write a summary of the conversation.
""".strip()
def generate_training_prompt(
conversation: str, summary: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
return f"""### Instruction: {system_prompt}
### Input:
{conversation.strip()}
### Response:
{summary}
""".strip()
Cleaning and Formatting Text: We employ functions to clean and format the conversation text, removing unnecessary elements like URLs, user mentions, and extra spaces.
Creating Training Data: The key to our dataset preparation is generating summaries for each conversation. This involves piecing together the dialogues and their corresponding summaries, formatted for effective training.
def clean_text(text):
text = re.sub(r"http\S+", "", text)
text = re.sub(r"@[^\s]+", "", text)
text = re.sub(r"\s+", " ", text)
return re.sub(r"\^[^ ]+", "", text)
def create_conversation_text(data_point):
text = ""
for item in data_point["log"]:
user = clean_text(item["user utterance"])
text += f"user: {user.strip()}\n"
agent = clean_text(item["system response"])
text += f"agent: {agent.strip()}\n"
return text
#Most important part here is getting the summaries
def generate_text(data_point):
summaries = json.loads(data_point["original dialog info"])["summaries"][
"abstractive_summaries"
]
summary = summaries[0]
summary = " ".join(summary)
conversation_text = create_conversation_text(data_point)
return {
"conversation": conversation_text,
"summary": summary,
"text": generate_training_prompt(conversation_text, summary),
}
See sample conversations below.
Processing the Dataset
With our functions in place, we shuffle the original dataset and remove unnecessary columns to streamline it for training:
def process_dataset(data: Dataset):
return (
data.shuffle(seed=42)
.map(generate_text)
.remove_columns(
[
"original dialog id",
"new dialog id",
"dialog index",
"original dialog info",
"log",
"prompt",
]
)
)
dataset["train"] = process_dataset(dataset["train"])
dataset["validation"] = process_dataset(dataset["validation"])
dataset
With our data now meticulously cleaned and prepared, we’re set to load the model for the fine-tuning process.
3. Model Loading and Training
Initiating the Hugging Face Environment. First, log into Hugging Face to access your account and models:
notebook_login()
Setting Up the LLaMA 2 Model
We proceed to load the LLaMA 2 model from Hugging Face. In this step, we utilize the Bits and Bytes library for quantization and define our tokenizer:
Configuring Model Parameters
We adjust the model’s configuration to disable caching and specify quantization settings:
Preparing the PEFT Configuration
We apply the PEFT (Parameter-efficient Fine-tuning) method with LoRA (Low-Rank Adaptation) configurations, optimizing the model for causal language modeling:
def create_model_and_tokenizer():
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
use_safetensors=True,
quantization_config=bnb_config,
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
return model, tokenizer
model, tokenizer = create_model_and_tokenizer()
model.config.use_cache = False
model.config.quantization_config.to_dict()
lora_r = 16
lora_alpha = 64
lora_dropout = 0.1
lora_target_modules = [
"q_proj",
"up_proj",
"o_proj",
"k_proj",
"down_proj",
"gate_proj",
"v_proj",
]
peft_config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
target_modules=lora_target_modules,
bias="none",
task_type="CAUSAL_LM",
)
Here we use quantization and Lora Config and task type Casual_LM just predicts the next token.
Monitoring Training Performance
We observe the training progress by evaluating both the evaluation and training loss metrics, ensuring the training is proceeding effectively.
Next, we set up our training arguments. These parameters are crucial for optimizing the model training process:
training_arguments = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
logging_steps=1,
learning_rate=1e-4,
fp16=True,
max_grad_norm=0.3,
num_train_epochs=2,
evaluation_strategy="steps",
eval_steps=0.2,
warmup_ratio=0.05,
save_strategy="epoch",
group_by_length=True,
output_dir=OUTPUT_DIR,
report_to="tensorboard",
save_safetensors=True,
lr_scheduler_type="cosine",
seed=42,
)
Below is a list of hyperparameters that can be used to optimize the training process:
- output_dir: The output directory is where the model predictions and checkpoints will be stored.
- num_train_epochs: One training epoch.
- fp16/bf16: Disable fp16/bf16 training.
- per_device_train_batch_size: Batch size per GPU for training.
- per_device_eval_batch_size: Batch size per GPU for evaluation.
- gradient_accumulation_steps: This refers to the number of steps required to accumulate the gradients during the update process.
- gradient_checkpointing: Enabling gradient checkpointing.
- max_grad_norm: Gradient clipping.
- learning_rate: Initial learning rate.
- weight_decay: Weight decay is applied to all layers except bias/LayerNorm weights.
- Optim: Model optimizer (AdamW optimizer).
- lr_scheduler_type: Learning rate schedule.
- max_steps: Number of training steps.
- warmup_ratio: Ratio of steps for a linear warmup.
- group_by_length: This can significantly improve performance and accelerate the training process.
- save_steps: Save checkpoint every 25 update steps.
- logging_steps: Log every 25 update steps.
Model Fine-Tuning Process
The fine-tuning process involves using the SFT (Supervised Fine-Tuning) Trainer from the TRL library. This tool simplifies training language models with reinforcement learning techniques:
Supervised fine-tuning (SFT) plays a crucial role in the process of Reinforcement Learning from Human Feedback (RLHF). The TRL library, offered by HuggingFace, features a user-friendly API that simplifies the creation and training of SFT models. With just a few lines of code, you can train your dataset effectively. This library includes a comprehensive toolkit for training language models via reinforcement learning, beginning with SFT, progressing through reward modeling, and culminating in Proximal Policy Optimization (PPO).
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=4096,
tokenizer=tokenizer,
args=training_arguments,
)
Training Loss Analysis
After training, we analyze the training loss to gauge the model’s performance and convergence:
Saving the Model and Merging Adapters
Finally, we save the trained model and merge the QLoRA adapter for optimal performance:
trainer.save_model()
trainer.model
from peft import AutoPeftModelForCausalLM
trained_model = AutoPeftModelForCausalLM.from_pretrained(
OUTPUT_DIR,
low_cpu_mem_usage=True,
)
merged_model = trained_model.merge_and_unload()
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
4. Inference: Evaluating Model Performance
Preparing Test Examples: to assess our model’s effectiveness, we select five examples from the test dataset that the model has never encountered before:
def generate_prompt(
conversation: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
return f"""### Instruction: {system_prompt}
### Input:
{conversation.strip()}
### Response:
""".strip()
examples = []
for data_point in dataset["test"].select(range(5)):
summaries = json.loads(data_point["original dialog info"])["summaries"][
"abstractive_summaries"
]
summary = summaries[0]
summary = " ".join(summary)
conversation = create_conversation_text(data_point)
examples.append(
{
"summary": summary,
"conversation": conversation,
"prompt": generate_prompt(conversation),
}
)
test_df = pd.DataFrame(examples)
test_df
Base Model Inference: initially, we perform inference using the base model to establish a benchmark:
model, tokenizer = create_model_and_tokenizer()
def summarize(model, text: str):
inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
inputs_length = len(inputs["input_ids"][0])
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0001)
return tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True)
Above function just summarizes some text by passing in the tokenizer.
Observations from Base Model Inference
In the third example of inference with the base model, we notice that the output closely resembles the input, indicating inadequate summarization capabilities.
Enhanced Inference with the Fine-Tuned Model
Next, we switch to our fine-tuned model, which exhibits a marked improvement in performance:
Comparison and Results
Upon comparing the outputs, we observe that the fine-tuned model’s summarization is significantly more accurate and coherent. The summary generated by the fine-tuned model is substantially more effective, showcasing the benefits of our training process.
Conclusion: Embracing Open Source AI with LLaMA 2
As we conclude this guide on fine-tuning LLaMA 2, it’s evident that the AI landscape of 2023 isn’t solely defined by high-profile projects like ChatGPT, but also by the groundbreaking advancements in open-source AI. LLaMA’s emergence as an open-source large language model has been a game-changer, democratizing AI by enabling a wide array of developers and smaller organizations to contribute to and benefit from advanced AI technologies. This shift towards open-source AI is not just a trend but a movement towards more collaborative and inclusive AI development, where resources and knowledge are shared, empowering a larger community to innovate and grow.
In this journey, we’ve explored the potential of LLaMA 2, demonstrating that even with limited resources, one can achieve significant advancements. The techniques of PEFT, LoRA, and QLoRA exemplify efficient fine-tuning, making sophisticated AI models more accessible and adaptable. The essence of 2023’s AI story, therefore, lies in the fusion of technological advancement with the spirit of open-source collaboration, paving the way for a future where AI is more accessible, inclusive, and innovative.
Link to notebook:
Resources
- https://ai.meta.com/llama/
- https://ai.meta.com/llama/get-started/
- https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
- https://github.com/ggerganov/llama.cpp
- https://pytorch.org/blog/introduction-to-quantization-on-pytorch/
- https://huggingface.co/datasets/financial_phrasebank
- https://www.datacamp.com/tutorial/fine-tuning-llama-2
- https://huggingface.co/docs/optimum/concept_guides/quantization
- https://huggingface.co/datasets/Salesforce/dialogstudio/viewer/TweetSumm
- https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html
- https://venturebeat.com/ai/forget-chatgpt-why-llama-and-open-source-ai-win-2023/
- https://forums.developer.nvidia.com/t/a100-vs-v100-for-ml-training/171230
- https://huggingface.co/blog/stackllama
- https://huggingface.co/docs/trl/index
- https://github.com/facebookresearch/llama-recipes
- https://cloud.google.com/model-garden
- https://github.com/huggingface/peft
- https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard