Fine tuning a Large Language Model

7 min readSep 17, 2023

Large Language Models (LLMs) are advanced artificial intelligence models designed for natural language understanding and generation. They are built upon deep learning architectures, such as transformers, and are trained on massive datasets to acquire knowledge and language proficiency.

Training Large Language Models (LLMs) like GPT-3 or similar architectures is a complex and resource-intensive process that comes with several challenges and difficulties:

Massive Computational Resources: Training LLMs requires access to powerful hardware, such as GPUs or TPUs, and often distributed computing clusters. These resources are expensive and may not be readily available to all researchers or organizations.
Enormous Data: LLMs are pretrained on massive datasets, often comprising terabytes of text from the internet. Collecting and pre-processing this data is a formidable task that can be difficult and time-consuming.
Time-Consuming Training: Training LLMs can take weeks or even months, depending on the model’s size and complexity. This extended timeline can be a significant hurdle.
Energy Consumption: Training large models consumes a significant amount of electricity, contributing to environmental concerns.

There are only few companies out there, who actually had the vast amount of data and compute to actually train a foundational LLM from scratch. But, to rescue the huge community of Data Scientist, Researchers, Students some of these foundational language models that serve as a starting point for fine-tuning on specific tasks or applications have been made available in the open-source community. These models are often pretrained on large datasets and released to the public for research and development purposes. e.g. Bert, GPT, T5, Flan-T5 etc.

In this article we will be using a FLAN-T5 model and fine tune it for a downstream task of producing dialogues summaries.

FLAN stands for “Fine-tuned Language Net”
T-5 stands for “Text-To-Text Transfer Transformer”

FLAN-T5 model is a encoder-decoder model that has been pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.

During the training phase, FLAN-T5 was fed a large corpus of text data and was trained to predict missing words in an input text via a fill in the blank style objective. This process is repeated multiple times until the model has learned to generate text that is similar to the input data.

Read more on FLAN-T5 here : https://exemplary.ai/blog/flan-t5

We will be using Dialogue Summary Dataset from Hugging Face to finetune FLAN-T5 model.

Read more on Dialogue Sum dataset here: https://huggingface.co/datasets/knkarthick/dialogsum

Loading some important Libraries:

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

Loading the Dataset from Hugging-Face:

data_name = “knkarthick/dialogsum”
dataset = load_dataset(data_name)

1. Sample Data from Dialogue Summary Dataset

print(‘Dialogue:\n’)
print(dataset[‘train’][0][‘dialogue’]+’\n’)
print(‘Summary:\n’)
print(dataset[‘train’][0][‘summary’]+’\n’)
print(‘Topic:\n’)
print(dataset[‘train’][0][‘topic’]+’\n’)

2. Sample Dialogue and Summary produced by a Human

Loading FLAN T5-BASE Model:

Let’s load the flan t5-base model from hugging face along with its tokenizer.

google/flan-t5-base · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

model_name = ’google/flan-t5-base’
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

text = "Hello, How are you ? How is your Family ?"

encoded_text = tokenizer(text, return_tensors='pt')
decoded_text = tokenizer.decode(
        encoded_text["input_ids"][0],
        skip_special_tokens=True # if False then , you will get Hello, How are you? How is your Family?</s>
    )

print('ENCODED SENTENCE:')
print(encoded_text["input_ids"][0])
print()
print('DECODED SENTENCE:')
print(decoded_text)

Inference with LLM using Zero-Shot Inference:

There are few inference techniques widely used with LLMs, these are mentioned below:

Zero Shot Learning : There is absolutely no labelled data present, so there is no need for additional training. It allows a pre-trained LLM to generate responses to tasks that it hasn’t been specifically trained for.

Few Shot Learning : It involves training a model to perform new tasks by providing only a few examples. This is useful where limited labelled data is available for training.

dialogue = dataset['test'][1]['dialogue']
summary = dataset['test'][1]['summary']

print('Dialogue:\n')
print(dialogue+'\n')
print('Human Summary:\n')
print(summary+'\n')

input_with_prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:


"""

inputs = tokenizer(input_with_prompt, return_tensors='pt')
model_output = model.generate(
            inputs["input_ids"].to('cuda'),
            max_new_tokens=50,
        )
output = tokenizer.decode(
        model_output[0],
        skip_special_tokens=True
    )
print('Model Summary with Prompt:\n')
print(output)

def create_custom_promt(dialogues, summaries, inference_dialogue):
  assert len(dialogues) == len(summaries)
  prompt = ''
  for index, dialogue in enumerate(dialogues):
    prompt += f"""
    Summarize the following conversation.:
    {dialogue}
    Summary:
    {summaries[index]}
    """
    prompt += f"""
    Summarize the following conversation.
    {inference_dialogue}
    Summary:
    """
  return prompt


multi_shot_prompt = create_custom_promt([dataset['train'][0]['dialogue'], dataset['train'][1]['dialogue'], dataset['train'][2]['dialogue']],
                                        [dataset['train'][0]['summary'], dataset['train'][1]['summary'], dataset['train'][2]['summary']],
                                        dataset['test'][1]['dialogue']
                                        )
inputs = tokenizer(multi_shot_prompt, return_tensors='pt')
model_output = model.generate(
            inputs["input_ids"].to('cuda'),
            max_new_tokens=50,
        )
output = tokenizer.decode(
        model_output[0],
        skip_special_tokens=True
    )
print('Model Summary with Prompt:\n')
print(output)

This will use give few examples of dialogue summaries as a prompt to the model before making the inference. Generally few shot inference provide better text generation ability to the model, but after 4–5 shot inference, the performance and quality of generated text may decrease and further fine-tuning may be required.

Fine Tuning of LLMs:

Fine-tuning and parameter-efficient fine-tuning are two approaches used in machine learning to improve the performance of pre-trained models on a specific task.

Fine-tuning is taking a pre-trained model and training it further on a new task with new data. The entire pre-trained model is usually trained in fine-tuning, including all its layers and parameters. This process can be computationally expensive and time-consuming, especially for large models.

On the other hand, parameter-efficient fine-tuning is a method of fine-tuning that focuses on training only a subset of the pre-trained model’s parameters. This approach involves identifying the most important parameters for the new task and only updating those parameters during training. Doing so, PEFT can significantly reduce the computation required for fine-tuning.

Full Fine-Tuning :

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

# Tokenize the Dataset:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

output_dir = f'./dialogue-summary-training'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

Full fine tuning would require huge amount of compute and training time, so loading the already trained model and calling it instruct model.

original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base', torch_dtype=torch.bfloat16)
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("./flan_tuned_model", torch_dtype=torch.bfloat16)

Perform Parameter Efficient Fine-Tuning (PEFT) :

PEFT is a generic term that includes Low-Rank Adaptation (LoRA) and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM — on the order of a single-digit % of the original LLM size (MBs vs GBs).

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request. The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

Read more on this here : https://arxiv.org/pdf/2205.05638.pdf

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.10,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM 
)

peft_model = get_peft_model(original_model, 
                            lora_config)
peft_trainer.train()

peft_model_path="./peft-model"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Evaluate Performance of Full-Fine Tuning and PEFT :

We can use many metrics for evaluation of Large Language Model. One ot the technique is ROUGE (https://huggingface.co/spaces/evaluate-metric/rouge )

We will be using Evaluate Library from Hugging Face to get ROUGE scores. https://huggingface.co/docs/evaluate/index

dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

You will observe that the Full-Fine Tuning is working better as compared to the PEFT technique. However, the PEFT training requires much less computing and memory resources (often just a single GPU).

So, we have seen how can we fine tune a Foundational Model for a downstream task like Dialogue Summary production. We learnt about some of the Inferences techniques and Fine-Tuning processed. You can further adhere to Responsible AI by testing your fine-tuned model for Bias, Toxicity etc. and use AI in the loop and reinforcement learning to help your model produce more robust and responsible responses.

Note: This blogs takes reference from Generative AI course by Andrew Ng and team.

Fine tuning a Large Language Model

google/flan-t5-base · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Perform Parameter Efficient Fine-Tuning (PEFT) :

Written by Rajneesh Jha