Get Insight from your Business Data — Build LLM application with PEFT (with LoRA) using 🤗 Hugging Face

Ashish Kumar Jain
11 min readSep 16, 2023

--

In our second blog on LLM’s series ( Build LLM application with fine tuning), we have discussed how we can build LLM application using full fine tuning with our own business data. There were two issues with that approach.

  1. As full fine tuning update all the weights of the model so it requires enough compute and memory to train the model as in pre-training. In general LLM models have parameter in billions (like GPT3 has 175b and even smaller model like falcon-7b has 7b). Even training also adds other parameter apart form training weights like Optimizer States, Gradient, Forward Activation and Temp Memory which increase the total parameter size from 12 to 20%. So we need a way to fast track the training with less compute and memory.
  2. Other issue is Catastrophic forgetting in which the full Fine tuning can increase the performance of the model on a particular task but on the other hand that may lead to reduction in ability on the other task. As in PEFT, most of the weights of LLM are not changing, it’s less pron to catastrophic forgetting. Even we can have different PEFT weights per task.

PEFT (Parameter Efficient fine-tuning)

PEFT, aims to resolve above problems using different methods for example by only training a small set of parameters (with most layer frozen) which might be a subset of the existing model parameters or a set of newly trainable layers or new parameters. These methods differ in parameter efficiency, memory efficiency, training speed, final quality of the model, and additional inference costs (if any).

PEFT Methods

There are three main classes of methods: Selection-based, Addition-based, and Reparametrization-based.

  1. Selective Method — We only fine tune the small subset of initial model parameter. One of the example of this is Attention tuning. Researcher has found performance of these selective based methods are mixed and there are significantly trade-off between parameter efficiency and compute efficiency.
  2. Additive Method — The main idea behind additive methods is augmenting the existing pre-trained model with extra parameters or layers and training only the newly added parameters. There are two large categories in this i.e Adapters and Soft Prompts. Adapters involves introducing small fully- connected trainable layer within the Transformer architecture where as Soft Prompts aims to control the behavior of a LLM by modifying the input prompt by keeping its architecture fixed and frozen. We will see “Prompt tuning (not prompt Engineering)” which is one of method of Soft Prompt in our future blog. :-)
  3. Reparameterization — Reparametrization-based parameter-efficient fine-tuning methods leverage low-rank representations to minimize the number of trainable parameters. Low-ranked is aimed to capture underlying low-dimensional structures of high dimensional data and attracted much attention in the area of the pattern recognition and signal processing. Intuition behind reparameterization is that it freezes the original LLM parameters and introduce new small numbers of trainable parameters by creating new low rank transformations of the original LLM network.The most well-known and widely used reparametrization-based method is Low-Rank Adaption (LoRA) which we will see in this blog.

LoRA (Low rank Adaption of LLM)

LoRA is one of the widely used PEFT method. Even in most cases when someone says PEFT, they typically mean LoRA only. In the LoRA, we kept original weights of model frozen and inject the small new trainable parameters with low dimensions matrices. To keep things simple, i will not go in details for Transformer architecture and how model weights are calculated in pre-training and push to the self-attention layer. We can cover transformer architecture in future blog. :-)

Idea here is that we can injects two new low dimension matrices (assume A and B) along side of actual weights matrices. Sizes of these two new matrices can be selected in such a way so there product (Assume C) should be same dimension as the size of actual model weights. During the fine tuning of a model for a task, all pre-trained model parameters are kept frozen, and only A and B matrices are trainable. Once the fine tuning is completed, we have these new weights which is product of A and B matrix. These matrix weights are now trained for a specific task.

Assume size of actual weight matrix W — L1 * L2

Size of A matrix — L1 * r (Where r is rank of small matrix)

Size of B matrix — r * L2

So now product of these two matrix C = L1 *L2. We can understand it by seeing below diagram.

At the time of inference, product of matrix C can be integrated into the original W by just adding the matrix C to the original matrix W and replace them in the model with these updated values. We now have a LoRA fine-tuned model that can carry out our specific task. As this model has the same number of parameters as the original, there is little to no impact on inference latency.

Researchers have found that applying LoRA only for self-attention layers of the model is often enough to fine-tune a model for a task and achieve performance gains.

Compute Saving

We can check compute saving using LoRA on a practical example. Let’s assume transformer weight matrix has dimension of W = 512 * 64 means trainable parameter will be 32678. If we use LoRA with rank 8, Matrix A will have 512 * 8 = 4096 and Matrix B will have 64 * 8 = 512 parameter. This reduced the trainable parameter from 32,678 to 4608 which is 86% reduction in parameter. This saving is huge enough even we can fine -tune model using LoRA on single GPU and avoid the need for a distributed cluster of GPUs. This method outperforms other methods (Selective — BitFit and Adapters) and has been evaluated on the models up-to 175B parameters.

LoRA for each Task

Another benefit of using LoRA is that we require very less compute for a task and rank decomposition LoRA matrices are small so we can fine tune the same model on different tasks. Each fine tuning will generate a LoRA matrices on each task which can be switch at the time of inference by adding LoRA matrices weights in model weight.

Lets understand this by an example, assume we have fine-tuned a model using LoRA for question-answer task, now at the time of inference we can get the product of decomposition LoRA matrices (A*B) resulting the matrix C and add these weights into frozen weights (W) and update the model. We can now use this updated model to carry out inference on the question-answer task. Now for different task like Summarization we can fine tune the same model using LoRA, for inference we can add these new decomposition LoRA matrices weights © into weights into frozen weights(W) and update the model.The memory required to store these LoRA matrices is very small. So we can use LoRA to train for many tasks. Switch out the weights when you need to use them.

First benefit we get here is saving the memory by avoiding store multiple full-size versions of the LLM for each task. Second benefit is that it eliminates the problem of “Catastrophic forgetting” as we don t change the original weights of LLM for a task.

Performance of models fine-tuned using LoRA is comparable or little degradation to the performance of fully fine-tuned models but saving on compute is significantly high. So Small trade-off in performance may worth it.

Value of Rank r

What should be value of Rank r for our decomposition LoRA matrices? Researcher has found that ranks in the range of 4 to 32 (4,8,16,32) can provide us a good trade-off between reducing trainable parameters and preserving performance. Even now Optimizing the choice of rank is an ongoing area of research and best practices may evolve in near future.

LoRA Fine tuning implementation

In the blog we will use 🤗 Hugging Face, it is a platform where machine learning community collaborates on models, datasets and applications. Hugging Face introduce the 🤗 PEFT library, which provides the latest Parameter-Efficient Fine-tuning techniques seamlessly integrated with Transformers. It support LoRA technique as well. We will use this PEFT library for our implementation. We will also use Hugging Face to download one of the open source LLM model FLAN_T5 from Google . We will load this model from the local machine. You can easily download this model from Hugging Face by cloning the model repository. It will help you to run this code without internet or in very constrained environment. Downloading model can take time depending on your network speed.

git lfs install
git clone https://huggingface.co/google/flan-t5-large

We will use python code for illustration purpose. You can use this code for your applications. You can also refer Hugging Face site for more code references. You also need to install below required python library to run the code.

! pip install 'transformers[torch]'
! pip install datasets
! pip install evaluate==0.4.0
! pip install rouge_score==0.1.2
! pip install peft==0.3.0
! pip install loralib==0.1.1

Implementation Steps

1-) We need to prepare the training data for fine tuning the model. Hugging Face provides various datasets which we can use for training purpose. In real scenario we need to create the dataset with our own business data. We will create small dataset with General knowledge for the training with question answer task. We have taken these questions from General knowledge dataset from Hugging Face. You can create similar bigger dataset based on your different tasks like summarzation, classification, question-answer etc with your business data. Hugging Face provides functionality using which you can easily create Hugging Face dataset using your business data.

Sample Dataset Format

{
"version": "0.1.0",
"data": [
{
"Question": "Generate a list of three uses of big data?",
"Answer": "1. Big data can be used to identify patterns and trends in customer behavior.\\n2. Big data can be used to improve customer service and experience.\\n3. Big data can be used to develop predictive models for marketing and sales."
},
{
"Question": "Predict the weather tomorrow morning.?",
"Answer": "Tomorrow morning is expected to be sunny with temperatures ranging from 15 to 19 degrees Celsius."
}
]
}

Hugging Face provides Datasets library for easily accessing and sharing datasets. We can load the dataset in a single line of code from multiple sources (Hugging Face hub, local files systems and memory etc) in different formats (CSV, JSON, parquet, arrow, sql for reading from database etc) and use its powerful data processing methods to quickly get our dataset ready for training with LLM.

from datasets import load_dataset

data_files = {"train":"dataset/GK/train.json",
"test":"dataset/GK/validation.json",
"validation":"dataset/GK/test.json"
}
dataset = load_dataset("json",data_files = data_files,field="data")

2-) Hugging Face provides tokenizer class which is in charge of preparing the inputs for a model. We will use open source FLAN-T5-LARGE model from Hugging Face and load from local. It is a good encode-decoder instruct model. It shows good capability in many tasks.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

modelPath = "model/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(modelPath)
base_model = AutoModelForSeq2SeqLM.from_pretrained(modelPath)

3-) We will create instructed dataset for the training. We will convert the question-answer pairs into explicit instructions for the LLM.

Prompt instruction format

Assuming you are working as General Knowladge instructor. Can you 
please answer the below question?

Question : Generate a list of three uses of big data
Answer : 1. Big data can be used to identify patterns and trends in customer behavior.\\n2. Big data can be used to improve customer service and experience.\\n3. Big data can be used to develop predictive models for marketing and sales.

Create the instruction dataset in above format

def prompt_generator(batchData):
start = 'Assuming you are working as General Knowladge instructor. Can you please answer the below question?\n\n'
end = '\n Answer: '
training_prompt = [start + question + end for question in batchData['Question']]
batchData['input_ids'] = tokenizer(training_prompt, padding="max_length", return_tensors="pt").input_ids
batchData['labels'] = tokenizer(batchData['Answer'], padding="max_length", return_tensors="pt").input_ids
return batchData

instructed_datasets = dataset.map(prompt_generator, batched=True)
instructed_datasets = instructed_datasets.remove_columns(['id','Question', 'Answer'])

4-) Let’s create the LoRA configuration where we will specify rank r as hyper parameter and other configuration parameter. We will then create the PEFT version of base model using which we can train the new LoRA matrices (adapter).

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
r=32, # Rank
lora_alpha=32,#LoRA scaling factor
target_modules=["q", "v"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)
peft_model = get_peft_model(base_model,
lora_config)

5-) We will use the PyTorch framework for peft/LoRA fine tuning the model. The Hugging Face Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. Before instantiating our Trainer object, we will create a TrainingArguments to access all the points of customization during training. In below code i am using only 1 epoch for model training, you can choose no. of epochs and other training parameter based on your compute, memory available and based on the final model evaluation result.

from transformers import TrainingArguments, Trainer
import time

output_dir = f'./model/peft-trained-model-output/flan-output-{str(int(time.time()))}'

training_args = TrainingArguments(
output_dir=output_dir,
evaluation_strategy="epoch",
learning_rate=1e-3,
num_train_epochs=1,
weight_decay=0.01,
logging_steps=1,
max_steps =1
)

trainer = Trainer(
model=peft_model,
args=training_args,
train_dataset=instructed_datasets['train'],
eval_dataset=instructed_datasets['validation']
)
trainer.train()

6-) After training you can save the PEFT model for future evaluation and inference use. if you check the saved model, you will find that it saves only adapter part of it and size is very less compared to base model.

saved_dir = f'./model/peft-trained-model/flan-trained-{str(int(time.time()))}'

tokenizer.save_pretrained(saved_dir)
peft_model.save_pretrained(saved_dir)

7-) We can load the saved PEFT adapter from local file system along with its base FLAN_T5 model. We are passing is_trainable=false as we will use it only for inference not for further training.

from peft import PeftModel

peft_model = PeftModel.from_pretrained(base_model,'model/peft-trained-model/flan-trained-1693655699', is_trainable=False)

8-) We can evaluate the model with two approaches — One with Qualitatively using human and Quantitatively with Rouge metric.

For Generative AI applications, a qualitative approach where we ask our-self a question “Is my model behaving right way?” is usually a good starting point. We can see that by manually seeing the difference between actual answer with answers given by the peft model. We can use our test dataset for evaluation.

from transformers import GenerationConfig
import pandas as pd

questions = dataset['test']['Question']
actual_answers = dataset['test']['Answer']
peft_model_answers = []

for _, question in enumerate(questions):
prompt = f"""

Assuming you are working as General Knowladge instructor. Can you please answer the below question?

{question}
Answer:""";
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
peft_model_answers.append(peft_model_text_output)

answers = list(zip(questions,actual_answers,peft_model_answers))
df = pd.DataFrame(answers, columns = ['question','actual answer','peft model answer'])
df

9-) Other evaluating approach is qualitative. The ROUGE metric helps quantify the validity of answers produced by models. It compares answers to a actual answer which is part of our test dataset. You can read more about this from ROUGE metric.

import evaluate
rouge = evaluate.load('rouge')
peft_model_results = rouge.compute(
predictions=peft_model_answers,
references=actual_answers,
use_aggregator=True,
use_stemmer=True,
)
print(peft_model_results)

10-) PEFT Training a model is a iterative process, so based on the evaluation result we may need to go through training process by tuning the training parameter or changing our datasets.

11-) Once we finalized the PEFT adapters , we can deploy our base model in the model hub with different adapters trained for each task and serve our client or applications through API.

Thanks for reading this blog. You can download the source code for this blog from my git repository.

References -

1-) Paper on LoRA — The paper outline the PEFT method using LoRA

2-) Paper on PEFT — This paper presents a systematic overview and comparison of parameter-efficient fine- tuning methods.

3-) https://www.coursera.org/learn/generative-ai-with-llms/

4-) https://huggingface.co/

5-) https://eugeneyan.com/writing/llm-patterns/

Originally published at https://www.linkedin.com.

--

--

Ashish Kumar Jain
Ashish Kumar Jain

Written by Ashish Kumar Jain

Engineer Leader | AI Scientist | AWS Professional/ML Certified