The Ultimate Guide to Fine-Tuning Large Language Models with Hugging Face

Jayesh Suthar
14 min readMay 27, 2024

--

In the rapidly evolving world of artificial intelligence, fine-tuning large language models (LLMs) has become a crucial skill for developers and data scientists aiming to tailor AI capabilities to specific tasks. Hugging Face, a leader in natural language processing (NLP) tools, offers robust libraries that simplify this process, enabling the customization of pre-trained models with your unique datasets.

In this article, we will delve into the step-by-step process of fine-tuning an LLM using Hugging Face libraries. Whether you’re looking to enhance your model’s performance on niche tasks or personalize it for specific applications, this guide will provide you with the knowledge and tools to achieve remarkable results. From preparing your custom dataset to implementing fine-tuning techniques, we’ll cover everything you need to transform a general-purpose language model into a specialized powerhouse tailored to your needs.

Join us as we unlock the potential of your data and explore the advanced methodologies that make fine-tuning with Hugging Face an essential practice for any AI enthusiast or professional.

Before diving into the technical details of fine-tuning Large Language Models (LLMs) using Hugging Face libraries, it’s crucial to understand why fine-tuning is necessary. You might wonder why we can’t just use the foundation model as it is. To answer this, we need to examine the limitations of foundation models and the benefits of fine-tuning.

What’s the need for fine-tuning LLM?

Foundation LLMs, in their basic form, can be thought of as sophisticated sentence completers. They are trained primarily to predict the next word in a sequence, given the previous words. While this makes them incredibly powerful for generating coherent text, it also means that their responses are solely based on the data they were originally trained on. When you prompt a foundation model with a query, it attempts to generate the next word or phrase based on its training data. This can result in several issues:

  1. Irrelevant Responses: Since the model is trained to predict the next word without specific context, it might generate content that is irrelevant to your query.
  2. Generic Outputs: The responses can be too general, lacking the specificity required for particular tasks or domains.
  3. Lack of Focus: The model might ask follow-up questions or veer off-topic, as it tries to continue the conversation based on its training.

These limitations arise because foundation models are designed to be versatile and broad, rather than tailored to specific applications. They excel at general language tasks but often fall short when precision and relevance are crucial.

The Role of Fine-Tuning

Fine-tuning addresses these limitations by adapting the foundation model to better suit specific tasks or domains. This process involves training the model further on a custom dataset that contains examples relevant to the desired application. Fine-tuning helps in several ways:

  1. Enhanced Relevance: The model becomes better at generating responses that are directly relevant to the context provided by the custom dataset.
  2. Improved Specificity: It can produce more specific and accurate outputs tailored to particular tasks or industries.
  3. Contextual Understanding: Fine-tuning helps the model understand and maintain context more effectively, reducing the likelihood of irrelevant or off-topic responses.

By fine-tuning an LLM with a custom dataset, you transform a general-purpose tool into a specialized one, capable of delivering high-quality, relevant, and context-aware responses. This is why fine-tuning is an essential step for anyone looking to deploy LLMs in specific applications or industries.

Fine-Tuning Steps

Fine-tuning a Large Language Model (LLM) involves several key steps to ensure that the model adapts effectively to the specific task or domain. Here’s a structured outline of the process:

1. Select a Pre-Trained Model

The first step in LLM fine-tuning is to carefully select a base pre-trained model that aligns with your desired architecture and functionalities.

2. Gather Relevant Dataset

Next, you need to gather a dataset that is relevant to your task. This dataset should be structured in a way that allows the model to learn from it. The more specific and high-quality your dataset, the better the model will perform on the task at hand.

3. Preprocess Dataset

Once your dataset is ready, preprocessing is essential to prepare it for fine-tuning. This involves:

  • Cleaning: Removing any irrelevant or noisy data.
  • Splitting: Dividing the dataset into training, validation, and test sets.
  • Formatting: Ensuring the data is compatible with the model’s requirements.

4. Fine-Tuning

With a pre-trained model selected and a relevant, preprocessed dataset in hand, the next step is to fine-tune the model. This involves training the model on your specific dataset, allowing it to adapt and specialize in the context of your task or domain. The fine-tuning process modifies the model’s parameters to better fit the patterns and nuances of your data, enhancing its performance for the intended application.

5. Evaluating and Comparing

After training is complete, it’s crucial to evaluate the fine-tuned model. This involves:

  • Response Evaluation: Assessing the model’s responses against the correct answers from your dataset.
  • Comparison with Base Model: Comparing the fine-tuned model’s performance with that of the base (before fine-tuning) model to determine the effectiveness of the fine-tuning process.

Setting up the environment

While we will utilize a Kaggle notebook for this demonstration, feel free to use any Jupyter notebook environment. Kaggle offers a generous allowance of 30 hours of free GPU usage per week, which is ample for our experimentation.

To begin, let’s open a new notebook, establish some headings, and then proceed to connect to the runtime. Here, we will select the GPU P100 as the ACCELERATOR. Feel free to try other GPU options available in Kaggle or any other environment.

In this tutorial, we will be using HuggingFace libraries to download and train the model. To download models from HuggingFace, we will need an Access Token. If you’ve already signed up with HuggingFace, you can generate a new Access Token from the settings section or use any existing Access Token.

Installing necessary libraries:
Now, let’s install the necessary libraries for this experiment.

pip install -q -U bitsandbytes transformers peft accelerate datasets trl

Don’t worry I will explain all the packages to you moving forward.

Now, Log into hugging face with your account token.

from huggingface_hub import login
login(
token="", # ADD YOUR TOKEN HERE
add_to_git_credential=True
)
import os
# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"
import torch

Cool, now that we have the environment set. We will proceed to load up our pre-trained LLM from Hugging Face.

Loading The Pre Trained Model

You might be wondering about the practical steps involved in downloading model weights, reconstructing the transformers model, setting up the attention blocks, and dense layers, and managing the tokenizer. It can seem daunting, but thankfully, the Hugging Face Transformers library simplifies this process considerably.

The Hugging Face Transformers library handles many of the complex steps involved in fine-tuning a model.

Using Auto-Class for Simplification

The Transformers library offers Auto-Classes, such as AutoModel, AutoTokenizer, and AutoConfig, which automates much of the setup.

from transformers import AutoTokenizer, AutoModelForCausalLM

Alright, now we will pick which pre-trained model we intend to use for this use case. Well for experimental purposes I have used Google’s gemma-2b-instruct model, this can change based on your particular need.

But before we download and load the weights we have to understand an important concept which is quantization, which can help us save a huge amount of compute resources and training time.

Quantization

Quantization in Large Language Models (LLMs) is a technique used to reduce the computational and memory requirements of these models by converting the high-precision floating-point weights and activations into lower-precision representations, typically 8-bit integers. This process significantly reduces the model size and speeds up inference, making it feasible to deploy large models on resource-constrained devices such as mobile phones and edge servers.

In our scenario with limited GPU memory, leveraging quantized LLM weights becomes a practical solution. While this approach may result in a reduction in model accuracy due to the lower precision of weight values, it’s important to note that we intend to compensate for this loss through subsequent fine-tuning steps.

To implement this we will be using Bitsandbytes.

from transformers import BitsAndBytesConfig

Finally, our code will look like this:

compute_dtype = getattr(torch, "float16")
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=False,
)
model_name="google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device_map = {"": 0}
model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=quantization_config,device_map=device_map)
tokenizer.padding_side = 'right' # to prevent warnings

Inferencing our model

You can inference our model in two ways:

Using pipeline from transformer

from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe( prompt,
max_new_tokens=256,
do_sample=False,
temperature=0.1,
top_k=50,
top_p=0.1,
eos_token_id=pipe.tokenizer.eos_token_id,
pad_token_id=pipe.tokenizer.pad_token_id)
print(f"Generated Answer:\n{outputs[0]['generated_text']})

Using generate() function

model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs)
print(tokenizer.decode(generated_ids, skip_special_tokens=True)[0])

Cool stuff right, generating text from a LLM model through code, you can play with both the ways of text generation and I recommend you to go through the provided link to their documentation too.

Let’s now move towards Dataset preparation for fine-tuning.

Create and prepare the dataset

So while picking up the dataset for fine-tuning we have to make sure the dataset should be a diverse set of demonstration of task you want to solve.

In our example we will use an already existing dataset called b-mc2/sql-create-context, which contains samples of natural language instructions, schema definitions and the corresponding SQL query.

Now, let’s look into the format of the data we need. We have two popular formats of dataset which are:

Conversational format

{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Instruction format

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

These two formats are widely supported format. If your dataset uses one of the above formats, you can directly pass it to the training set without pre-processing.

Our dataset is in different format so we will be converting it to the conversational format.

We will be using another hugging face library datasets for loading the dataset. We load the dataset, and select a subset of 12500 rows out of which 10000 will be our training set.

from datasets import load_dataset
system_message = """You are an text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
{schema}"""

def create_conversation(sample):
return {
"messages": [
{"role": "system", "content": system_message.format(schema=sample["context"])},
{"role": "user", "content": sample["question"]},
{"role": "assistant", "content": sample["answer"]}
]
}

dataset = load_dataset("b-mc2/sql-create-context", split="train")
dataset = dataset.shuffle().select(range(12500))

dataset = dataset.map(create_conversation, remove_columns=dataset.features,batched=False)
dataset = dataset.train_test_split(test_size=2500/12500)

print(dataset["train"][345]["messages"])

Converting our model and tokenizer to use the conversational format too.

model, tokenizer = setup_chat_format(model, tokenizer)

Alright, now we have the dataset also ready, let’s move to fine-tuning part.

Supervised Fine-Tuning LLM

Before jumping to the code, let’s first talk a little about Transfer learning.

Transfer learning

Transfer learning is a technique in machine learning where a model developed for one task is reused as a starting point for a model on a related task. By leveraging the knowledge gained from the pre-trained model, transfer learning allows for faster training and often results in improved performance on the new task.

We have various methods of transfer learning, you can read more about it on the internet as we won’t be dwelving into further detail. Just one method we would discuss briefly and that is adapters.

Adapters

The adapter model adds new modules between layers of a pre-trained network called adapters. This means that parameters are copied over from pre-training (meaning they remain fixed) and only a few additional task-specific parameters are added for each new task, all without affecting previous ones.

Keep in mind that only a small number of parameters are introduced in the proposed adapter-based tuning architecture, with the intention of keeping the original network unaffected and the training stable.

Read more about it here.

Now, finally we come to the fine-tuning technique that we will be employing and that is LoRA.

LoRA: Low-Rank Adaptation of Large Language Models

LoRA is a technique that improves fine-tuning by decomposing the larger weight matrix of a pre-trained model into two smaller matrices with rank r. These smaller matrices form the LoRA adapters, and fine-tuning is performed exclusively on these two matrices. Once fine-tuned, the adapter is loaded into the pre-trained model for inference.

After fine-tuning with LoRA for a specific task or use case, the original large language model (LLM) remains unchanged. Instead, a considerably smaller “LoRA adapter” is created, often representing only a single-digit percentage of the original LLM’s size, measured in megabytes rather than gigabytes. During inference, the LoRA adapter is combined with the original LLM. The primary advantage is that multiple LoRA adapters can reuse the same original LLM, thereby significantly reducing overall memory requirements when handling various tasks and use cases.

What is Quantized LoRA (QLoRA)?

QLoRA represents a more memory-efficient iteration of LoRA. QLoRA takes LoRA a step further by also quantizing the weights of the LoRA adapters.

The combination of all this techniques we call as PEFT (Parameter efficient Fine Tuning). Now let’s move to some code to demonstrate how this will be incoprated in our use case.

from peft import LoraConfig

peft_config = LoraConfig(
lora_alpha=32,
lora_dropout=0.05,
r=32,
bias="none",
target_modules=["q_proj", "k_proj", "v_proj", "dense"],
task_type="CAUSAL_LM",
)

Here we are taking rank as 32, and LoRA adapters will be applied to Q,K,V matrix of attention block and then the dense layer too.

Training Preparation

We will be using SFFTrainer from a library called TRL (Transformer Reinforcement Learning) for supervised fine tuning.

The SFTTrainer is a subclass of the Trainer from the transformers library and supports all the same features, including logging, evaluation, and checkpointing, but adds additional quality of life features, including:

  • Dataset formatting, including conversational and instruction format
  • Training on completions only, ignoring prompts
  • Packing datasets for more efficient training
  • PEFT (parameter-efficient fine-tuning) support including Q-LoRA
  • Preparing the model and tokenizer for conversational fine-tuning (e.g. adding special tokens)

Before we can start our training we need to define the hyperparameters (TrainingArguments) we want to use.

output_dir = f'./peft-gemma-2b-sql-SFTT'

args = TrainingArguments(
output_dir=output_dir, # output directory
num_train_epochs=1, # number of epochs to train
per_device_train_batch_size=1, # Per device batch size to be loaded in device
gradient_accumulation_steps=4, # Gradient accumulation steps for mini-batches
gradient_checkpointing=True, # Gradient checkpoint
optim="adamw_torch_fused",
logging_steps=25, # Logging steps
save_strategy="steps", # Save strategy to be steps, can also be epoch
learning_rate=2e-4,
fp16=True, # fp16 to be loaded and if your gpu supports bf16 then use that
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="constant",
max_steps=1000, # Max steps will override the training length
save_steps=100, # Save checkpoint after every save_steps
overwrite_output_dir = 'True', # will override the dir content

)

here we are not evaluating our model while training. If you want to add the evaluation step, pass these arguments too:

  • evaluation_strategy="steps","epoch"
  • eval_steps=100,
  • do_eval=True,

Alright, we now have all the building blocks to build our SFTTrainer and start the training.

Let’s begin.

from trl import SFTTrainer

max_seq_length = 1024 # max sequence length for model and packing of the dataset
trainer = SFTTrainer(
model=model,
args=args,
train_dataset=dataset['train'],
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
dataset_kwargs={
"add_special_tokens": False, # We template with special tokens
"append_concat_token": False, # No need to add additional separator token
}
)

SFTTrainer supports example packing, where multiple short examples are packed in the same input sequence to increase training efficiency.

Train the model and save it using-

# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()

Wait till the training is complete, it will depend on the GPU you are using and the parameters too of course. Do play with the training arguments passed and check out how they affect the training time and resource usage. For example ‘gradient_accumulation_steps’ increasing this will get increase your training time but will reduce the GPU memory usage as this will reduce the effective size of batch your GPU will load up and do forward and backward pass before updating the weights.

Upon training completion free up some memory.

del model
torch.cuda.empty_cache()

Test and evaluate the fine tuned LLM

Great now that we have finally got to the point where we are done with the training part. That was quite a ride right ?

Finally now lets load up our fine tuned model with the help of AutoPeftModelForCausalLm from peft which will load the model with the adapters as well. And yes also the tokenizer.

from peft import AutoPeftModelForCausalLM
peft_model = AutoPeftModelForCausalLM.from_pretrained(
output_dir,
device_map="auto",
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(output_dir)
pipe = pipeline("text-generation", model=peft_model, tokenizer=tokenizer)

Output dir is where we saved our fine tuned model to.

Niceee, we have our fine tuned model ready. Let’s evaulate how well it perform with the evaluation dataset and will also compare it with the base (before fine tuned) model too.

So first lets load the base model too.

base_model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=quantization_config,device_map=device_map)
base_tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model,base_tokenizer=setup_chat_format(base_model, base_tokenizer)
base_model_pipe = pipeline("text-generation", model=base_model, tokenizer=base_tokenizer)

We will use apply_chat_template to prepare our prompt in conversational format and leave the assistant block empty for generation.

Looping for 5 samples and generating outputs from both the model.

from random import randint

eval_dataset = dataset['test']
for i in range(5):
rand_idx = randint(0, len(eval_dataset))

# Test on sample
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
base_model_prompt = base_model_pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)

outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
base_model_outputs = base_model_pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)

print(f"Context:\n{eval_dataset[rand_idx]['messages'][0]['content']}")
print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}\n")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}\n")
print(f"Base Model Generated Answer:\n{base_model_outputs[0]['generated_text'][len(prompt):].strip()}")

print("\n\n")

And it generates something like this:

Context:
You are an text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
CREATE TABLE table_name_77 (score VARCHAR, losing_team VARCHAR, total VARCHAR)
Query:
Which Score has a Losing Team of sydney roosters, and a Total of 88?
Original Answer:
SELECT score FROM table_name_77 WHERE losing_team = "sydney roosters" AND total = 88

Generated Answer:
SELECT score FROM table_name_77 WHERE losing_team = "sydney roosters" AND total = 88

Base Model Generated Answer:
Which Score has a Losing Team of sydney roosters, and a Total of 88?

As you can see how our model is performing on the eval_dataset and how it is compared to model before fine tuning there is a significant improvement in the PEFT model as compared to the original model.

GREAT SUCCESS!!!

To access the full notebook visit my github: https://github.com/govegito/fine_tuning_LLM/blob/main/finetune-gemma2b-sftt.ipynb

Also I have created another notebook where I don’t use the conversational format but use custom formatting function, pass it to the SFFTrainer object that will automatically apply that formatting function before processing the dataset, and also in that notebook we only train for completion and not for the whole prompt. So that results in faster training.

Here is the link to that one too: https://github.com/govegito/fine_tuning_LLM/blob/main/fine-tune-gemma2b-sftt-completion-only.ipynb

Thank you for staying with me till the end. I hope you now have a much deeper understanding of fine-tuning LLMs, the various libraries available, and the different techniques that can be employed. You also know how to train these models for specific use cases. This knowledge can benefit you as a developer, enabling you to come up with new ideas for applications that leverage the power of LLMs. With this expertise, you can build products that cater to niche domains, unlocking new possibilities and innovations.

Thank you for accompanying me on this journey, and I look forward to embarking on another such implementation with you in the future. Until then,

PEACE OUT!!

References link:

--

--