Finetuning Large Language Models: Customize Llama 3 8B For Your Needs

Milos Zivic
9 min readApr 30, 2024

--

Images generated by LLMs regarding LLM finetuning (GPT: left and middle, Gemini: right)

Since its release in November 2022, ChatGPT has sparked widespread discussion about the capabilities of Large Language Models (LLMs) and AI in general. It’s now rare to find someone who hasn’t heard of ChatGPT or experimented with it. While tools like GPT, Gemini, or Claude are incredibly powerful, with hundreds (if not thousands) of billions of parameters and pretrained on vast corpora of text, they are not omnipotent. There are specific tasks where these models fall short. However, we are not without solutions for these tasks. We can harness the power of LLMs using smaller open-source models, adapting them to our specific problems.

This blog aims to provide a brief overview of a few smaller open-source LLMs and explain two crucial concepts for LLM finetuning: Quantization and LoRA. Additionally, we’ll introduce a couple of the most popular libraries for finetuning, along with code examples, so you can quickly apply these concepts to your use case. Let’s dive into finetuning.

If you would like to skip the theory and go straight to the code click here.

Table of Contents

  1. “Small” Large Language Models
  2. Quantization
  3. Low-Rank Adaptation (LoRA)
  4. Unsloth
  5. Supervised Finetuning Trainer (SFT)
  6. Odds Ratio Preference Optimization (ORPO)
  7. Conclusion
  8. References

“Small” Large Language Models

Llama 3 comparison to other models

Finetuning LLMs can be prohibitively expensive, especially for models with a high number of parameters. As a rule of thumb, models under 10 billion parameters can usually be finetuned without encountering significant infrastructure challenges. However, for larger models like Llama 3 70B, substantial resources are required. Finetuning a 70B parameter model like Llama 3 requires approximately 1.5 terabytes of GPU vRAM. To put this in perspective, this amount of vRAM is equivalent to a cluster of approximately 20 Nvidia A100s, each with 80GB of vRAM. The cost of such a setup is approximately $400,000, assuming the hardware is even available.

Alternatively, one can use cloud providers such as AWS, Azure, or GCP, but this approach is also costly. For instance, one hour of using an 8 Nvidia A100 GPUs on AWS costs $40. If you were to finetune the 70B model on 20 GPUs for 5 days, it would cost approximately $12,000.

Due to these costs, most practitioners primarily use smaller LLMs with fewer than 10 billion parameters. These models can be trained more affordably, requiring only 16GB to 24GB of vRAM (for larger batch sizes and faster training). For example, I finetuned Mistral 7B to the Serbian language using an Nvidia A10 on AWS, which took less than 10 hours and cost less than $20.

Of course, still, a 7B model could not fit and be trained on that much vRAM without quantization, specifically quantization to 4-bit.

Quantization

With full 32-bit parameters, we would still need a ridiculous (by mortals’ standard) amount of vRAM for training the LLM — somewhere around 150GB.

Converting FP32 to INT8

Quantization offers a solution by converting model parameters to low-precision data types, such as 8-bit or 4-bit, significantly reducing memory consumption and improving execution speed. The concept is straightforward: all possible 32-bit values are mapped to a smaller range of finite values (e.g., 256 for 8-bit conversion). This process can be visualized as grouping high-precision values around a few fixed points that represent the values in their vicinity.

Low-Rank Adaptation (LoRA)

LoRA is a technique used to update a model’s weights by employing matrix dimensionality reduction. This technique is particularly relevant because transformers, which are widely used in LLMs, rely heavily on matrices. A detailed explanation of LoRA’s workings at a low level can be found in a blog post by Jay Alammar.

Regular Finetuning

When updating the weights of a model, it’s necessary to adjust the parameters within these matrices. Conceptually, this adjustment can be seen as adding a weight update matrix to the original matrix: W’ = W + ΔW. LoRA introduces a novel approach by decomposing this update matrix into two smaller matrices that, when multiplied, approximate the update matrix. During finetuning, instead of creating and then decomposing the update matrix, LoRA directly creates these two smaller matrices for multiplication.

An illustrative comparison between regular finetuning and finetuning with LoRA can be seen in the images below, adapted from Sebastian Raschka’s blog post.

An alternative formulation of regular finetuning (left) compared with finetuning using LoRA (right)

The key benefit of LoRA is that while the approximation is slightly less precise, it significantly improves memory and computational efficiency. For instance, consider a matrix with 1000x1000 parameters, totaling 1 million parameters. By using the decomposed (and slightly less precise) version of 1000x100 multiplied by 100x1000 matrices, the parameter count reduces to only 2*100k, resulting in an 80% reduction in parameters.

Quantization and LoRA are often used in combination, forming what is known as QLoRA.

Unsloth

If I were to begin LLM finetuning anew, I would opt for the Unsloth Python library. Unsloth offers a variety of optimizations tailored for LLM finetuning and supports a wide array of popular LLMs, including Mistral, Llama 3, Gemma, and others. For example, their free tier encompasses 12 distinct finetuning optimizations for Mistral, providing a notable 2.2x acceleration.

Unsloth finetuning optimizations

Below are snippets of code demonstrating how to finetune Llama 3 8B using the Unsloth library. All these blocks of code are taken from the Unsloth GitHub and the full notebook for finetuning Llama 3 8B can be found here.

Import the model in 4-bit:

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Set up LoRA:

model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)

Initialize Hugging Face’s Supervised Finetuning Trainer:

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)

Train the model:

trainer_stats = trainer.train()

Supervised Finetuning Trainer (SFT)

After pretraining an LLM, the next critical step is supervised finetuning. This process is essential for developing a model that can understand and generate coherent responses, rather than simply completing sentences.

Tools like SFT (Supervised Finetuning Trainer) and PEFT (Parameter Efficient Finetuning) from Hugging Face, as well as BitsAndBytes by Tim Dettmers, significantly simplify the process of applying techniques such as LoRA, quantization, and finetuning to models. These libraries streamline the implementation of advanced optimization methods, making them more accessible and efficient for developers and researchers alike.

Below, you’ll notice that the code for Unsloth, SFT, and ORPO is quite similar. This similarity arises from the fact that the basic ideas behind these libraries are largely the same, with differences primarily in the libraries themselves and potentially in some hyperparameters.

Import the model in 4-bit:

# Hugging Face model id
model_id = "meta-llama/Meta-Llama-3-8B"
model_id = "mistralai/Mistral-7B-v0.1"

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16 if use_flash_attention2 else torch.float16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
use_cache=False,
device_map="auto",
token = os.environ["HF_TOKEN"], # if model is gated like llama or mistral
attn_implementation="flash_attention_2" if use_flash_attention2 else "sdpa"
)
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(
model_id,
token = os.environ["HF_TOKEN"], # if model is gated like llama or mistral
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Set up LoRA:

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
]
)

# Prepare model for training
model = prepare_model_for_kbit_training(model)

Initialize Hugging Face’s Supervised Finetuning Trainer:

args = TrainingArguments(
output_dir="mistral-int4-alpaca",
num_train_epochs=1,
per_device_train_batch_size=6 if use_flash_attention2 else 2, # you can play with the batch size depending on your hardware
gradient_accumulation_steps=4,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
logging_steps=10,
save_strategy="epoch",
learning_rate=2e-4,
bf16=use_flash_attention2,
fp16=not use_flash_attention2,
tf32=use_flash_attention2,
max_grad_norm=0.3,
warmup_steps=5,
lr_scheduler_type="linear",
disable_tqdm=False,
report_to="none"
)

model = get_peft_model(model, peft_config)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length=2048,
tokenizer=tokenizer,
packing=True,
formatting_func=format_instruction,
args=args,
)

Train the model:

trainer.train()

The full notebook can be found here.

Odds Ratio Preference Optimization (ORPO)

In this blog post, we’ve focused on the pretraining and supervised finetuning of LLMs. However, there’s another crucial step that all SOTA LLMs undergo: preference alignment. This step occurs after pretraining and finetuning, where you inform the model which generated outputs are desirable and which are not. Popular methods for preference alignment include Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO).

A new method called Odds Ratio Preference Optimization (ORPO) emerged in March 2024, combining supervised finetuning and preference alignment.

Traditional LLM finetuning vs ORPO LLM finetuning

For a detailed explanation of ORPO, including code examples and an overview, refer to Maxime Labonne’s insightful blog post.

Here we have part of the code for finetuning and preference alignment using ORPO. The full code is available here.

Import the model in 4-bit:

# Model
base_model = "meta-llama/Meta-Llama-3-8B"
new_model = "OrpoLlama-3-8B"

# QLoRA config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Load model
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map="auto",
attn_implementation=attn_implementation
)

Set up LoRA:

# LoRA config
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

model = prepare_model_for_kbit_training(model)

Initialize Hugging Face’s ORPO Trainer:

orpo_args = ORPOConfig(
learning_rate=8e-6,
beta=0.1,
lr_scheduler_type="linear",
max_length=1024,
max_prompt_length=512,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4,
optim="paged_adamw_8bit",
num_train_epochs=1,
evaluation_strategy="steps",
eval_steps=0.2,
logging_steps=1,
warmup_steps=10,
report_to="wandb",
output_dir="./results/",
)
trainer = ORPOTrainer(
model=model,
args=orpo_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
peft_config=peft_config,
tokenizer=tokenizer,
)

Train the model:

trainer.train()

Conclusion

While Large Language Models (LLMs) like GPT, Gemini, or Claude are powerful, their large size and resource requirements make them impractical for many tasks. To address this, smaller open-source LLMs can be finetuned and customized for specific needs using techniques like Quantization and Low-Rank Adaptation (LoRA). These techniques reduce memory consumption and improve computational efficiency, making it more affordable to train models, especially ones with fewer than 10B parameters.

Tools like Unsloth, Supervised Finetuning Trainer (SFT), and Odds Ratio Preference Optimization (ORPO) simplify the finetuning process and make it more accessible. Unsloth, for example, offers optimizations that can significantly accelerate training, while ORPO combines supervised finetuning with preference alignment to improve model performance.

By leveraging these techniques and tools, developers and researchers can tailor LLMs to their specific needs without the prohibitive costs associated with training large models. This approach democratizes access to advanced language models and enables a wide range of applications across different domains.

References

[1] Model Memory Utility: https://huggingface.co/spaces/hf-accelerate/model-memory-usage

[2] AWS EC2 P4d Instances Pricing: https://aws.amazon.com/ec2/instance-types/p4/#:~:text=to%20learn%20more%20%C2%BB-,Product%20details,-Instance%20Size

[3] Detailed Explanation of Quantization: https://huggingface.co/docs/optimum/concept_guides/quantization

[4] Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT: https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/

[5] The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/

[6] LoRA: Low-Rank Adaptation of Large Language Models: https://arxiv.org/abs/2106.09685

[7] Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA): https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html

[8] QLoRA: Efficient Finetuning of Quantized LLMs: https://arxiv.org/abs/2305.14314

[9] Unsloth: https://unsloth.ai/

[10] Supervised Fine-tuning Trainer: https://huggingface.co/docs/trl/sft_trainer

[11] ORPO: Monolithic Preference Optimization without Reference Model: https://arxiv.org/abs/2403.07691

[12] Fine-tune Llama 3 with ORPO: https://huggingface.co/blog/mlabonne/orpo-llama-3

--

--