How to finetune Llama 3 on consumer GPUs

Published in

Pondhouse Data

16 min readJul 2, 2024

Large Language Models (LLMs) have proven themselves as powerful tools capable of understanding and generating human-like text across a wide range of applications. For quite some time, closed source models were the way to go and clear winner in terms of performance. However, with the release of powerful open source models like the Llama3 series or many of the Mistral models, this has changed. These models are not only free to use, already very powerful on its own, but also can be fine-tuned for your specific use case. This means that they can — by definition — outperform closed source models.

Fine-tuning allows you to adapt a pre-trained LLM to your particular applications, improving its performance on targeted tasks without the need for training a model from scratch. This process can lead to significant improvements in accuracy, relevance, and efficiency for specialized applications.

In this blog post we’re going to introduce the fine-tuning process, exemplary on a Llama3 model. We’ll cover everything from the basics of what fine-tuning entails, setting up the required training environment, how to create your datasets for fine-tuning, how to execute the fine-tuning process and finally how to use your newly created model variant for text inference.

What is fine-tuning and why do we want to fine-tune LLMs?

Model fine-tuning is a process in machine learning where a pre-trained model is further trained on a specific dataset or task to improve its performance in a particular domain. In the context of Large Language Models (LLMs), fine-tuning allows us to adapt these powerful, general-purpose models to specialized applications without the need to train a new model from scratch.

This process adjusts the model’s parameters to better fit the new data and task requirements. The key idea is to leverage the knowledge and patterns learned by the model during its initial training on vast amounts of data, and then refine this knowledge for a more focused application.

Why is Fine-Tuning Important?

Improved Performance: Fine-tuning can significantly enhance a model’s performance on specific tasks or domains, often surpassing both the original pre-trained model and smaller models trained from scratch on the specific task.
Resource Efficiency: Training LLMs from scratch requires enormous computational resources and vast amounts of data. Fine-tuning allows us to achieve excellent results with much less data and computational power.
Customization: Fine-tuning enables the adaptation of general-purpose models to niche domains or specialized tasks, making them more relevant and accurate for specific use cases.
Faster Development: Instead of spending months or years developing and training a new model, fine-tuning allows researchers and developers to create high-performing, task-specific models in a matter of days or weeks.
Overcoming Limitations: Fine-tuning can help address some limitations of pre-trained models, such as biases or outdated information, by introducing new, carefully curated data.
Continuous Learning: As new data becomes available or requirements change, models can be periodically fine-tuned to stay up-to-date and relevant.

When to Consider Fine-Tuning

When you have a specific task or domain that differs from the general use case of the pre-trained model.
When you have a dataset that represents your specific use case or contains information not present in the original training data.
When you need to improve the model’s performance on particular types of inputs or outputs.
When you want to adapt the model’s style, tone, or domain-specific language.

Now that we understand the importance and benefits of fine-tuning LLMs, let’s dive into the practical aspects of LLM fine-tuning.

Setting up your environment for fine-tuning LLMs

Before engaging in the fine-tuning process itself, let’s get our system ready. This chapter outlines our hardware as well as software requirements and guides you step by step through the installation process. This ensures you have all the necessary tools and resources to efficiently fine-tune Large Language Models.

As this chapter sounds quite extensive, let me assure you first, that it’s not. We only need the following tools and accounts:

A Hugging Face account to download the base models and (optionally) some training datasets.
A handful of Python packages to run the training process. These are pytorch, transformers, datasets, accellerate, peft,huggingface_hub, bitsandbytes, and trl.
Hardware: While it’s theoretically possible to fine-tune LLMs on CPUs, as of time of this writing, using a CUDA enabled NVIDIA GPU is highly recommended. This will speed up the training process significantly.

It’s possible to use AMD, Intel or even Mac-GPUs for model fine-tuning, however, due to NVIDIA’s dominance in the field of AI, most tools are optimized for NVIDIA — therefore our guide assumes NVIDIA GPUs.

Hardware Requirements

The main requirement for fine-tuning LLMs is the GPU — and there the limiting factor is the amount of VRAM. The model itself needs to fit in GPU memory as well as some headroom for the adapters and training process.

If you have a GPU with Ampere architecture or later — like the NVIDIA RTX4090, RTX3090 or A10, you can also benefit from Flash Attention. Flash Attention is a technique to both reduce the memory footprint of the training process as well as drastically speed up the training process. (In very short: It rearranges parts of the fine-tuning process to better utilize the GPU’s architecture. Read more here.

If you are not one of the lucky people with your own GPUs, use one of the many GPU cloud providers. Two we use regularly and can recommend are runpod.io and latitude.sh.

Setting up your Hugging Face account

Navigate to the Hugging Face sign up page.
Use the form to create a new account.
After being logged in to your new account, navigate to Settings -> Access Tokens
Create a new access token. Note the token, as we’ll need it in a minute.

Installing the python packages

The final preparation step is to install the python packages. The following preparation assumes python 3.11.

First, make sure you have a virtual environment set up, either use venv or anaconda.

We regularly use the latter and therefore recommend it.

Activate your virtual environment
Install the packages:

pip install \
"torch==2.2.2" \
tensorboard \
"transformers==4.40.0" \
"datasets==2.18.0" \
"accelerate==0.29.3" \
"bitsandbytes==0.43.1" \
"huggingface_hub==0.22.2" \
"trl==0.8.6" \
"peft==0.10.0"

Note: Make sure to pin the library versions as the python ecosystem is known for introducing incompatibilities when upgrading minor versions. Better save than sorry…

As a last step, let’s initialize our Hugging Face CLI:

huggingface-cli login --token "<your-token>"

That’s it, we’re done.

What did we actually install?

Let’s quickly introduce the libraries we installed — to get a little more familiar with the tools we use:

torch: PyTorch is an open-source machine learning library that provides a flexible and efficient framework for building and training all sorts of machine learning and deep learning models.
tensorboard: TensorBoard is a visualization toolkit for machine learning experimentation. It enables tracking and visualizing metrics such as loss and accuracy, model architecture, and more.
transformers: The Hugging Face Transformers library provides a wide range of pre-trained models for natural language processing tasks, as well as tools for fine-tuning and deploying these models.
datasets: The Hugging Face Datasets library provides a collection of datasets for natural language processing tasks, as well as tools for loading, processing, and interacting with these datasets.
accelerate: The Hugging Face Accelerate library provides tools for distributed training and mixed-precision training of deep learning models.
bitsandbytes: The Hugging Face Bits and Bytes Library provides tools for CUDA optimizers and quantization.
huggingface_hub: The Hugging Face Hub library provides tools for sharing, discovering, and using models and datasets from the Hugging Face hub.
trl: The Hugging Face TRL library provides tools for training and fine-tuning deep learning models with teacher-student learning. It also provides a convenient interface for fine-tuning.
peft: The Hugging Face PEFT library (Parameter-Efficient Fine-Tuning) provides tools for training and fine-tuning deep learning models with progressive early stopping — which means adapting large pretrained models to various downstream applications without fine-tuning all of a model’s parameters.

How to create a dataset for fine-tuning LLMs?

The main question many beginners in the fine-tuning space have is, what data do I need to fine-tune my model? And how specifically do I need to prepare the dataset?

While it’s hard to impossible to give a general answer on what data exactly you need and especially how many of them, there are some general guidelines which apply:

Domain-specific data: If you want to fine-tune a model for a specific domain, you should gather data that is relevant to that domain. For example, if you are fine-tuning a model for legal text, you should use legal documents. If you are fine-tuning a model for medical text, you should use medical records.
Task-specific data: If you are fine-tuning a model for a specific task, you should gather data that is relevant to that task. For example, if you are fine-tuning a model for sentiment analysis, you should use a dataset with examples of the sentiments you want to find.
Diverse examples: It’s important to have a diverse set of examples in your dataset. This helps the model learn to generalize to new examples. This last part is especially important. You want the model to learn to generalize to new examples.
Quality before quantity: It’s better to have a smaller dataset with high quality and diverse examples than a large dataset with potentially wrong or repeating examples. LLMs have shown to be able to quickly fit to specific examples. This means: a. the models are very quickly “destroyed” if they see wrong examples (as they might “remember” them forever) but also b. models learn very quickly from examples. We created successful fine-tunes with just 100 rows of high-quality data.

Now, how to get to these data?

Create them manually: Still one of the best approaches, leading to highest quality. But, also the most time consuming.
Use LLMs to generate synthetic fine-tuning data: Many LLMs are trained nowadays on synthetic data. LLMs can generate quite good datasets when instructed well. This can be useful when you either want to extend a small manually created dataset or if you want to fine-tune your model for a field which the other model is already good in (eg. using GPT-4 to make smaller models better).
Another approach for synthetic data generation is, to use your internal documentation, emails, or other text sources and use them in the synthetic generation prompt. This way, the LLM which generates your dataset grounds it’s output in your very own language, domain and truth.
Use existing, public datasets: There are hundreds of thousands of datasets on the Hugging Face dataset hub. Using one of them might be a good staring point — especially for very general purpose applications like text-to-sql. Not much sense in creating your own, when there are already established ones available.

Format of a fine-tuning dataset

With the introduction of the remarkable trl library, formatting fine-tuning datasets got much easier. Before showing the formats, let's consider one last time what we want to do: We want to teach a model to understand a specific domain or task and generate correct answers based on our input question or prompt.

This gives a hint on how to structure the dataset: We need a “question” and we need an “answer”.

Specifically, with trl there are two ways to implement this format:

The conversational format:

{"messages": [{"role": "system", "content": "Your system message"}, {"role": "user", "content": "Your first prompt"}, {"role": "assistant", "content": "Your first answer"}]}
{"messages": [{"role": "system", "content": "Your system message"}, {"role": "user", "content": "Your second prompt"}, {"role": "assistant", "content": "Your second answer"}]}
{....}

2. The instruction format:

{"prompt": "Your first prompt", "completion": "Your first answer"}
{"prompt": "Your second prompt", "completion": "Your second answer"}
{....}

As you might find, these formats are very close to what the LLM APIs require the format to be during the subsequent inference.

In the first format, we assume a chatbot-like application and basically create artificial conversations. In the second format, we assume a instruction -> answer type of system and therefore simply create question/answer pairs.

In summary, decide for one format or another and make sure each “row” on your dataset is formatted as defined above.

What about test data?

An important question is how to split training and test data. While again this depends on the use-case, it’s best to have 10–20% of your data dedicated for testing/validation. Either manually attribute the rows to training or testing (make sure to “uniformly” distribute the data) or simply use a random split. The latter works well for larger datasets, while the smaller the dataset gets the more you want to manually create the split.

It’s best to save both the test and training dataset as separate files on your file system. Especially for larger datasets, keeping them just in memory might get on your nerves, as you might need to restart your python environment during setting up the training process.

Hands-on: Execute the fine-tuning process

We’re almost there — but we’ve to talk about one theoretical aspect before: Most models are just enormous. ENORMOUS. They would require 100s of GBs of VRAM to be fine-tuned. However, thanks to enormous efforts of the open-source community, we have some options here. And, as experience shows, these options don’t reduce the quality of the fine-tuned result in a significant way!

QLoRA: QLoRA was first introduced in May 2023 and was a major breakthrough in the field of fine-tuning. Quoting from the paper: QLoRA is an efficient fine-tuning approach that reduces memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning task performance. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and © paged optimziers to manage memory spikes.
As we’ll see in a second, while QLoRA provides outstanding results, Hugging Face did a similarly great job in integrating QLoRA into their transformers library. Therefore, we can use it without much hassle.
For more information on QLoRA, check the Hugging Face intro.
Model quantization with bitsandbytes: While QLoRA reduces the memory usage, the models itself are still as big as they are. Using bitsandbytes we can quantize the model to lower floating point accuracy.
What does this mean? In short: We reduce the model’s size by reducing the number of bits used to represent the model’s weights. This can reduce the model’s memory footprint and speed up the training process. This can can can theoretically also reduce the models performance, however, in practice this is surprisingly rare.

**Floating Point formats (from** **https://huggingface.co/blog/4bit-transformers-bitsandbyte)**

Read more about both of the optimizations we are using in this phenomenal blog post from Hugging Face

Now that we have this out of our hands, let’s demonstrate the fine-tuning process.

First, import the required modules.

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, AutoPeftModelForCausalLM
from trl import setup_chat_format, SFTTrainer

Then, load the dataset, and make sure it’s in the jsonl format discussed above. Load the training dataset.

train_data = load_dataset("json", data_files="dataset_training.json", split="train")

Next, lets load the model in 4-Bit quantization, using bitsandbytes:

model_id = "meta-llama/Meta-Llama-3-8B" # The model to fine-tune, Hugging Face id

quantization = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization,
    attn_implementation="flash_attention_2", # Use 'eager', if not NVIDIA ampere or later architecture
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right'

If you are using the “conversational format” for your training data (which is most likely the case), you need to add special tokens to the LLM which it mostly is not yet aware of. These tokens are to identify the start and end of a conversation, as well as of messages of the different roles (system, user, assistant). trl provides a convenient function for that:

model, tokenizer = setup_chat_format(model, tokenizer)

Next, we define our QLoRA configuration.

Please check the tips in the line comments. The LoRA config is not the same for every model and even dataset. This needs to be experimented on. Try different values as indicated in the comments and compare the model results.

qlora = LoraConfig(
        lora_alpha=8, # LoRA scaling factor. Set to either 8, 16, 32, 64 or 128
        r=16, # Rank, set to 16, 32, 64, 128 or 256. 16 seems best for most applications
        lora_dropout=0.05, # From QLoRA paper. Keep at 0.05
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

The LoRA parameters explained

lora_alpha: Defines the scaling of the low-rank matrices — it basically tells how much influence the low-rank matrices have for the fine-tuning process. Theoretically, the higher this value, the more “new” knowledge will influence the model. The lower the value, the more the already existing knowledge will prevail.
r: Defines the mathematical rank of the low-rank matrices, which defines the size of the matrices. The higher the number, the bigger the size of the matrices, the longer fine-tuning takes. But, bigger numbers also mean, that the influence of the low-rank adapters to the final model are higher. Therefore, this is again a parameter where you want to balance efficiency with model performance. In general, our experience shows that a rank of 16 produces good results for most current-gen models.
lora_dropout: During fine-tuning, randomly a portion of parameters are “dropped out”. This should mainly prevent overfitting. High values might lead to underfitting, meaning the model will not “remember” what you told it. Low values might reduce the models ability to generalize over your data (because it just “clings” to the training data provided)

Almost there, before running the training process, let’s define the training hyperparameters. Refer to the code comments for the ‘easier’ parameters and the section below the code sample for a more detailed description what they entail.

training_params = TrainingArguments(
    output_dir="llama3-finetuned",          # directory to save to
    num_train_epochs=3,
    per_device_train_batch_size=3,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch",
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper, keep it at that
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper, keep it at that
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper, keep it at that
    lr_scheduler_type="constant",
    push_to_hub=False,                      # don't push model to hugging face hub
    report_to="tensorboard",                # report metrics to tensorboard
)

The Hugging Face Training Arguments explained

num_train_epochs: Total number of training epochs to perform. 3 is good for most use-cases. As a rule of thumb, when your training loss stops declining, you can stop training. Check your loss curve in the tensorboard to check, whether the training loss stays more or less constant after 1 or 2 epochs — reduce this parameter in such a case.
per_device_train_batch_size: Batch size per GPU core for training — how many examples are processed at once. In general, set as high as your GPU memory allows, as, the training process is faster. Combine with gradient_accumulation_steps.
gradient_accumulation_steps: Number of updates steps to accumulate the gradients for, before performing back propagation. So, after how many batches do you want to back propagate? While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can result in a more pronounced training slowdown. Consider the following example. Let’s say, the per_device_train_batch_size=4 without gradient accumulation hits the GPU’s limit. If you would like to train with batches of size 64, do not set the per_device_train_batch_size to 1 and gradient_accumulation_steps to 64. Instead, keep per_device_train_batch_size=4 and set gradient_accumulation_steps=16. This results in the same effective batch size while making better use of the available GPU resources. (From Hugging Face)
optim: Which optimizer to user. adamw_torch or adamw_torch_fused are best from our experience.
lr_scheduler_type: The learning rate scheduler to use. constant applies the same learning rate to all epochs. cosine decays the learning rate after each training epoch. For QLoRA applications, keep constant.

Okay, we are finally there, we can put all together, instantiate our Trainer class and run the training process:

trainer = SFTTrainer(
    model=model,
    args=training_params,
    train_dataset=train_data,
    peft_config=quantization,
    max_seq_length=2048,              # Maximum number of tokens
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={                  # These settings are default as shown by many Hugging Face tutorials
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)

trainer.train()
trainer.save_model()

The training script summarized

Find the python code for LLM fine-tuning summarized below:

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, AutoPeftModelForCausalLM
from trl import setup_chat_format, SFTTrainer

train_data = load_dataset("json", data_files="dataset_training.json", split="train")

model_id = "meta-llama/Meta-Llama-3-8B" # The model to fine-tune, Hugging Face id

quantization = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization,
    attn_implementation="flash_attention_2", # Use 'eager', if not NVIDIA ampere or later architecture
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right'
model, tokenizer = setup_chat_format(model, tokenizer)

qlora = LoraConfig(
        lora_alpha=8, # LoRA scaling factor. Set to either 8, 16, 32, 64 or 128
        r=16, # Rank, set to 16, 32, 64, 128 or 256. 16 seems best for most applications
        lora_dropout=0.05, # From QLoRA paper. Keep at 0.05
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

training_params = TrainingArguments(
    output_dir="llama3-finetuned",          # directory to save to
    num_train_epochs=3,
    per_device_train_batch_size=3,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch",
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper, keep it at that
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper, keep it at that
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper, keep it at that
    lr_scheduler_type="constant",
    push_to_hub=False,                      # don't push model to hugging face hub
    report_to="tensorboard",                # report metrics to tensorboard
)

trainer = SFTTrainer(
    model=model,
    args=training_params,
    train_dataset=train_data,
    peft_config=quantization,
    max_seq_length=2048,              # Maximum number of tokens
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={                  # These settings are default as shown by many Hugging Face tutorials
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)

trainer.train()
trainer.save_model()

Use the fine-tuned Large Language Model

Now that we trained our model, we want to use (and as a matter of fact evaluate) it.

model = AutoPeftModelForCausalLM.from_pretrained(
  "./llama3-finetuned", # Directory where to load the model and adapters from
  torch_dtype=torch.float16,
  quantization_config= {"load_in_4bit": True}, # Whether to quantize the model during loading
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./llama3-finetuned")

# Create your input messages for the LLM
messages = [{"role": "system", "content": "Your system message"},
{"role": "user", "content": "Your first prompt"}]

input_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt").to(model.device)
outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    eos_token_id= tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]

For our example, the results look pretty promising!

If you want to deploy this model to production, read our follow-up piece on How to securely self host your own LLM with vLLM and Caddy

How to finetune Llama 3 on consumer GPUs

What is fine-tuning and why do we want to fine-tune LLMs?

Why is Fine-Tuning Important?

When to Consider Fine-Tuning

Setting up your environment for fine-tuning LLMs

Hardware Requirements

Setting up your Hugging Face account

Installing the python packages

How to create a dataset for fine-tuning LLMs?

Format of a fine-tuning dataset

Hands-on: Execute the fine-tuning process

The LoRA parameters explained

The Hugging Face Training Arguments explained

The training script summarized

Use the fine-tuned Large Language Model

Further Reading

Advanced RAG: Recursive Retrieval with llamaindex

With recursive retrieval, RAG can generate more coherent and contextually relevant responses. This guide introduces you…

Extract data from documents using Large Language Models

The more we work with AI, the more we need to extract data from documents. In a scalable manner. While conventional and…

Using AI directly from your database - with PostgreSQL and pgai

AI is transforming the way we create applications and interact with data. Almost every application needs some form of…

Written by Sascha Gstir