Get Insight from your Business Data — Build LLM application with Fine Tuning using Hugging Face

Ashish Kumar Jain
9 min readAug 22, 2023

--

Full Fine tuning of LLM

In our last blog, we have seen there are two approaches (Fine Tune and RAG) using which we can create LLM based applications with our own data. We have already covered RAG in previous blog Build LLM application with Langchain and Hugging Face using RAG. Now we can explore Fine Tuning. It is also called Instruction Fine tuning.

Fine Tune (Instruction Fine Tuning)

Instruction fine tuning is a method to improve the performance of an existing pre- trained model for our specific use case or task (Like Question Answering, Summarization, Translation, classification, Information Retrieval, Invoke APIs and actions etc) with our own data . Intent is to utilize the knowledge of pre-trained model and apply it to a specific task usually involve a smaller, task specific dataset. It is a great choice when we have a large amount of task specific labeled data and want to get insight of the data or want to chat with data. For example we want to summarize agent and customer’s chat in a call center for getting better insight of the complex chat data. We can fine-tune the LLMs with chat history database with labeled summery and then inference the model with real time chat history. As Fine tuning can be computationally expensive, time consuming and requires big infra (GPUs and memory) resources, So do we have any other options for doing similar tasks even without doing Fine Tuning? Yes, fortunately we have one more option which can be explored before finalizing Fine tuning that is “In-context learning”.

In-context Learning

When we pass the prompt to the LLM, sometimes we did not get the correct answer then we tune our prompt again and again to get the better result. This is known as prompt engineering. “In-context learning” further enhances this idea by providing task examples inside the prompt with our query. Providing examples and additional data about the task helps LLM to learn better about the task and produce greater result. For example for solving math equation we can pass the existing example equation with in the prompt. It helps LLM’s to learns equation and produce better result. Based on number of examples we pass in the prompt to LLM with our query we named it accordingly. Like with zero example its called Zero Shot Inference, with one example its called One Shot Inference and with few examples (mostly 5–6) its called Few Shot Inference. See below an example of One Shot Inference.

Typically Larger Model (Like GPT4 or Llama3) are very good in zero shot inference. These models can complete many tasks even these are not trained on those tasks. Small models are normally able to perform tasks typically on which they are trained on.

Limitation with In-context learning is that LLM are constrained with context window as a prompt (4096 tokens in GPT3). So you can not pass more examples in the prompt. However context window are getting larger and larger with new releases of model (32,768 in GPT4). Another limitation is that in-context learning is not good for solving complex tasks and performing multi-step reasoning tasks. To be useful practically, LLM should be capable of doing these complex tasks. We can cover more details in-context learning in future blog. :-)

If in-context learning is not working for our use-case(task) then we can go for fine tuning our model.

Fine tuning provides greater control over LLM behavior, resulting in more robust product.

Full Fine Tuning Process

For improving the performance of the model on a particular task, we need to do training of the base pre-trained model (Like GPT4) with labeled data on single task. Base LLM Model is already pre-trained on vast majority of unstructured textual data via self supervised learning where as fine tuning is a supervised learning process where we use a data set of labeled examples to update the weights of the exiting base LLM for a particular task.The training data set contains the prompt completion pairs with different examples, fine tuning process train the model with this data set to improve its ability to generate good completion for a specific task.The fine tuning which updates all the weights of the model is called full fine tuning. Other approach is called PEFT (Parameter efficient fine-tuning) where we update a small subset of parameters. So keep in mind that full fine tuning requires enough compute and memory to train the model like pre-training. In this blog we will see full fine tuning, we can cover PEFT in future blog. :-)

Once we finished the training we can evaluate instructed model (trained model) performance on various metrics and quantify its improvement over the actual answers. We can iterate through our training process until we get the desired result.

By fine tuning and hosting our own LLMs, ensure data does not leave our network, and scale throughput if needed.

Full Fine tuning Implementation

In the blog we will use Hugging Face, it is a platform where machine learning community collaborates on models, datasets and applications. We will use Hugging Face to download one of the open source LLM model FLAN_T5 from Google . We will load this model from the local machine. You can easily download this model from Hugging Face by cloning the model repository. It will help you to run this code without internet or in very constrained environment. Downloading model can take time depending on your network speed.

git lfs install 
git clone https://huggingface.co/google/flan-t5-large

We will use python code for illustration purpose. You can use this code for your applications.You can also refer Hugging Face site for more code references. You also need to install below required python library to run the code.

pip install 'transformers[torch]' 
pip install datasets
pip install evaluate==0.4.0
pip install rouge_score==0.1.2

Implementation Steps

1-) We need to prepare the training data for fine tuning the model. Hugging Face provides various datasets which we can use for training purpose. In real scenario we need to create the dataset with our own business data. We will create small dataset with General knowledge for the training with question answer task. We have taken these questions from General knowledge dataset from Hugging Face. You can create similar bigger dataset based on your different tasks like summarzation, classification, question-answer etc with your business data. Hugging Face provides functionality using which you can easily create Hugging Face dataset using your business data.

Sample Dataset Format

{
"version": "0.1.0",
"data": [
{
"Question": "Generate a list of three uses of big data?",
"Answer": "1. Big data can be used to identify patterns and trends in customer behavior.\\n2. Big data can be used to improve customer service and experience.\\n3. Big data can be used to develop predictive models for marketing and sales."
},
{
"Question": "Predict the weather tomorrow morning.?",
"Answer": "Tomorrow morning is expected to be sunny with temperatures ranging from 15 to 19 degrees Celsius."
}
]
}

Hugging Face provides Datasets library for easily accessing and sharing datasets. We can load the dataset in a single line of code from multiple sources (Hugging Face hub, local files systems and memory etc) in different formats (CSV, JSON, parquet, arrow, sql for reading from database etc) and use its powerful data processing methods to quickly get our dataset ready for training with LLM.

from datasets import load_dataset

data_files = {"train":"dataset/GK/train.json",
"test":"dataset/GK/validation.json",
"validation":"dataset/GK/test.json"
}
dataset = load_dataset("json",data_files = data_files,field="data")

2-) Hugging Face provides tokenizer class which is in charge of preparing the inputs for a model. We will use open source FLAN-T5-LARGE model from Hugging Face and load from local. It is a good encode-decoder instruct model. It shows good capability in many tasks.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

modelPath = "model/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(modelPath)
base_model = AutoModelForSeq2SeqLM.from_pretrained(modelPath)

3-) We will create instructed dataset for the training. We will convert the question-answer pairs into explicit instructions for the LLM.

Prompt instruction format

Assuming you are working as General Knowladge instructor. Can you 
please answer the below question?

Question : Generate a list of three uses of big data
Answer : 1. Big data can be used to identify patterns and trends in customer behavior.\\n2. Big data can be used to improve customer service and experience.\\n3. Big data can be used to develop predictive models for marketing and sales.

Create the instruction dataset in above format

def prompt_generator(batchData):
start = 'Assuming you are working as General Knowladge instructor. Can you please answer the below question?\n\n'
end = '\n Answer: '
training_prompt = [start + question + end for question in batchData['Question']]
batchData['input_ids'] = tokenizer(training_prompt, padding="max_length", return_tensors="pt").input_ids
batchData['labels'] = tokenizer(batchData['Answer'], padding="max_length", return_tensors="pt").input_ids
return batchData

instructed_datasets = dataset.map(prompt_generator, batched=True)
instructed_datasets = instructed_datasets.remove_columns(['id','Question', 'Answer'])

4-) We will use the PyTorch framework for fine tuning the model. The Hugging Face Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. Before instantiating our Trainer object, we will create a TrainingArguments to access all the points of customization during training. In below code i am using only 1 epoch for model training, you can choose no of epochs and other training parameter based on your compute, memory available and based on the final model evaluation result.

from transformers import TrainingArguments, Trainer
import time

output_dir = f'model/trained-model-output/flan-output-{str(int(time.time()))}'

training_args = TrainingArguments(
output_dir=output_dir,
evaluation_strategy="epoch",
learning_rate=1e-5,
num_train_epochs=1,
weight_decay=0.01,
max_steps =1
)

trainer = Trainer(
model=base_model,
args=training_args,
train_dataset=instructed_datasets['train'],
eval_dataset=instructed_datasets['validation']
)
trainer.train()

5-) After training you can save the instructed model for future evaluation and inference use.

saved_dir = f'./model/trained-model/flan-trained-{str(int(time.time()))}'
tokenizer.save_pretrained(saved_dir)
base_model.save_pretrained(saved_dir)

6-) We can load the saved fine tuned instructed model from local file system.

instruct_model = AutoModelForSeq2SeqLM.from_pretrained("model/trained-model/flan-trained-1692510911")

7-) We can evaluate the model with two approaches — One with Qualitatively using human and Quantitatively with Rouge metric.

For Generative AI applications, a qualitative approach where we ask our-self a question “Is my model behaving right way?” is usually a good starting point. We can see that by manually seeing the difference between actual answer with answers given by the instructed model. We can use our test dataset for evaluation.

from transformers import GenerationConfig
import pandas as pd

questions = dataset['test']['Question']
actual_answers = dataset['test']['Answer']
instruct_model_answers = []

for _, question in enumerate(questions):
prompt = f"""

Assuming you are working as General Knowladge instructor. Can you please answer the below question?

{question}
Answer:""";
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
instruct_model_answers.append(instruct_model_text_output)

answers = list(zip(questions,actual_answers,instruct_model_answers))
df = pd.DataFrame(answers, columns = ['question','actual answer','instruct model answer'])
df

8-) Other evaluating approach is qualitative. The ROUGE metric helps quantify the validity of answers produced by models. It compares answers to a actual answer which is part of our test dataset.You can read more about this from ROUGE metric.

import evaluate
rouge = evaluate.load('rouge')
instruct_model_results = rouge.compute(
predictions=instruct_model_answers,
references=actual_answers,
use_aggregator=True,
use_stemmer=True
)

9-) Training a model is iterative process so based on the evaluation result, we need to go through again with training process by tuning the training parameter or changing our datasets.

10-) Once we finalized the model, we can deploy that model in the model hub and serve our client or applications through API.

Single task fine tuning also makes our architecture modularization. We can train multiple smaller model each specializes in its own task with different kind of business data. These models can participate in bigger context to support multiple use-cases.

By fine tuning and hosting our own models, we can reduce legal concerns about proprietary data being exposed to external APIs.

Catastrophic forgetting

Fine tuning can increase the performance of the model on a particular task but on the other hand that may lead to reduction in ability on the other tasks. So if we want to use pre trained LLM on the other tasks and if performance suffers, how to avoid that?

  1. We can fine tune that LLM on the multiple tasks
  2. We can use instead PEFT (Parameter Efficient Fine-tuning). We will cover this in future blog. :-)

Thanks for reading this blog. You can download the source code for this blog from my git repository.

References -

1-) https://www.coursera.org/learn/generative-ai-with-llms/

2-) https://huggingface.co/

3-) https://eugeneyan.com/writing/llm-patterns/

--

--

Ashish Kumar Jain

Engineer Leader | AI Scientist | AWS Professional/ML Certified