LLM — Finetuning

Finetuning doesn’t have to be a mystery anymore! In this article, I’ve created a simple notebook that breaks down the process in an easy-to-understand way.

9 min readSep 17, 2023

This article aims to explain a small fine-tuning case over notebook codes.

There are many short courses provided by DeepLearning.AI and these are extremely time-efficient as they address a very specific topic. I’ve completed the lecture: Finetuning Large Language Models. The codes of this article are prepared from this short course. You may find the notebook here.

Let’s start! 🤩

What do we need?

🍕Dataset
🍕Model
🍕Required libraries: pandas, numpy, datasets, transformers, torch
🍕System requirement: CPU is enough for now.

What are the steps?

🎃Collect prompt completion pairs and create a JSONL file.
🎃Use AutoTokenizer module of the transformers library to get the tokenizer.
🎃Tokenize data; apply pad and truncate.
🎃Apply train test split.
🎃AutoModelForCausalLM module of transformers library to load the model.
🎃Check what gives the base model with one example from your dataset.
🎃Define TrainingArguments, and Trainer, then train the model.
🎃Make predictions from the fine-tuned model.
🎃Evaluate the results.

Data Preparation

I used ChatGPT and Bard to give me 30 questions about gender equality. I want to finetune a model to learn this dataset.

import pandas as pd
import numpy as np

df = pd.read_excel("data_30.xlsx")
df.head()

 prompt                                             completion 
0 What is gender equality?                          Gender equality refers to the equal rights, re...
1 Why is gender equality important in the workpl... Gender equality in the workplace is crucial be...
2 How does gender equality benefit society?         Gender equality benefits society by promoting ...
3 What are some common misconceptions about gend... Some common misconceptions about gender equali...
4 How can education play a role in promoting gen... Education is a powerful tool for promoting gen...

Turn it to JSONL format:

# Define the output JSONL file name
filename = 'output.jsonl'

# Iterate through the rows and write each row as a JSON object to the JSONL file
with open(filename, 'w') as jsonl_file:
    for _, row in df.iterrows():
        json_data = row.to_json(orient='columns')
        jsonl_file.write(json_data + '\n')

Tokenization

First, we need to get the tokenizer. We will use EleutherAI/pythia-70m model from the HuggingFace Transformers library.

We will collect the prompt and completion pairs (or question answers or input-output pairs) in a text.
We will tokenize “text” and return numpy arrays (return_tensors). Padding is true which means that we will add 0 at the end of shorter sentences.
We will find the max_length variable (If the maximum token number in a sentence is more than 2048, we will use 2048). Then we will truncate the “tokenized text” by max_length token.

The “labels” column is added so that it fits into the HuggingFace Dataset format.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

def tokenize_function(examples):
    if "question" in examples and "answer" in examples:
        text = examples["question"][0] + examples["answer"][0]
    elif "input" in examples and "output" in examples:
        text = examples["input"][0] + examples["output"][0]
    elif "prompt" in examples and "completion" in examples:  # our dataset
        text = examples["prompt"][0] + examples["completion"][0]
    else:
        text = examples["text"][0]

    # Add 0 for short sentences
    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )
    
    # find the max length after padding, select the min
    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    
    # truncate if the sentence is longer than 2048
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

finetuning_dataset_loaded = datasets.load_dataset("json", data_files=filename, split="train")

tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

Let’s examine the tokenized_dataset:

print(tokenized_dataset["prompt"][0])

# 'What is gender equality?'

print(tokenized_dataset["completion"][0])
# 'Gender equality refers to the equal rights, responsibilities, and opportunities of all individuals, regardless of their gender. It implies that the interests, needs, and priorities of both women and men are taken into consideration, recognizing the diversity of different groups of women and men.'

print(tokenized_dataset["input_ids"][0])

# [1276, 310, 8645, 13919, 32, 40945, 13919, 10770, 281, 253, 4503, 3570, 13, 19715, 13, 285, 9091, 273, 512, 4292, 13, 10159, 273, 616, 8645, 15, 733, 8018, 326, 253, 6284, 13, 3198, 13, 285, 23971, 273, 1097, 2255, 285, 1821, 403, 2668, 715, 8180, 13, 26182, 253, 9991, 273, 1027, 2390, 273, 2255, 285, 1821, 15]

Train test split

As I mentioned earlier, the type of tokenized_dataset is Dataset. We can use the train_test_split module directly. Here, the test size is 0.1 and we shuffle the data.

split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 27
    })
    test: Dataset({
        features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3
    })
})

Before Training

Let’s look at what the base model gives. I loaded the model with AutoModelForCausalLM and then I simply defined the device to the model.

model_name = "EleutherAI/pythia-70m"
base_model = AutoModelForCausalLM.from_pretrained(model_name)

device_count = torch.cuda.device_count()
if device_count > 0:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
    
base_model.to(device)
print(device)

I want to see an answer from a random prompt from the test dataset. After preparing the tokenized input we will get the prediction via base_model.generate()

test_text = test_dataset[0]['prompt']
max_input_tokens = 1000
max_output_tokens=100
# Tokenize
input_ids = tokenizer.encode(
      test_text,
      return_tensors="pt",
      truncation=True,
      max_length=max_input_tokens
)

# Generate
device = base_model.device
generated_tokens_with_prompt = base_model.generate(input_ids=input_ids.to(device), max_length=max_output_tokens)

This gives us a tokenized output. We need to decode it via tokenizer.batch_decode

# Decode
generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

The prediction includes both the prompt and the completion. We need only the completion part.

# Strip the prompt
generated_text_answer = generated_text_with_prompt[0][len(test_question):]

Here is the actual answer and the prediction:

Question input (test): How does gender equality benefit society?


Correct answer from docs: Gender equality benefits society by promoting social cohesion, economic growth, and sustainable development. When both men and women have equal opportunities to contribute, societies can tap into a broader range of talents, ideas, and perspectives, leading to more comprehensive solutions to complex challenges.


Model's answer: 


A:

The only way to get rid of this is to use the "gender equality" option.

A:

The only way to get rid of this is to use the "gender equality" option.

A:

You can use the "gender equality" option.

A:

You can use the "gender equality" option.

A:

You can use the "gender equality

As you see, the model’s answer is not very good. ＼（〇_ｏ）／

Training

:) Finally!😎

If you have used the Transformers library before, you get used to using Training Arguments and Trainer.

from transformers import TrainingArguments, Trainer

# number of epoch
max_steps = 100

# Save model to this direction
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name
save_dir = f'{output_dir}/final'

training_args = TrainingArguments(

  # Learning rate
  learning_rate=1.0e-5,

  # Number of training epochs
  num_train_epochs=1,

  # Max steps to train for (each step is a batch of data)
  # Overrides num_train_epochs, if not -1
  max_steps=max_steps,

  # Batch size for training
  per_device_train_batch_size=1,

  # Directory to save model checkpoints
  output_dir=output_dir,

  # Other arguments
  overwrite_output_dir=False, # Overwrite the content of the output directory
  disable_tqdm=False, # Disable progress bars
  eval_steps=120, # Number of update steps between two evaluations
  save_steps=120, # After # steps model is saved
  warmup_steps=1, # Number of warmup steps for learning rate scheduler
  per_device_eval_batch_size=1, # Batch size for evaluation
  evaluation_strategy="steps",
  save_strategy="steps",
  logging_strategy="steps",
  logging_steps=1,
  optim="adafactor",
  gradient_accumulation_steps = 4,
  gradient_checkpointing=False,

  # Parameters for early stopping
  load_best_model_at_end=True,
  save_total_limit=1,
  metric_for_best_model="eval_loss",
  greater_is_better=False
)



trainer = Trainer(
    model=base_model,
    # model_flops=model_flops,
    # total_steps=max_steps,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)


training_output = trainer.train()

trainer.save_model(save_dir)
print("Saved model to:", save_dir)

Is that all?

Yes! And Congratulations! You have fine-tuned your first model! 🎉✨🎀

Use Fine-Tuned Model and Get Predictions

device_count = torch.cuda.device_count()
if device_count > 0:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)
finetuned_slightly_model.to(device)

You can use finetuned_slightly_model to make predictions.

def generate_output(test_question, model):

    # Tokenize
    input_ids = tokenizer.encode(
          test_question,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
    )

    # Generate
    device = model.device
    generated_tokens_with_prompt = model.generate(input_ids=input_ids.to(device), max_length=max_output_tokens)

    # Decode
    generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

    # Strip the prompt
    generated_text_answer = generated_text_with_prompt[0][len(test_question):]
    return generated_text_answer

These are the same codes we’ve seen before. I put them in a function. Let’s see the first prompt in the test dataset, the actual answer, the base model’s (before fine-tuning) answer, and the fine-tuned model’s answer.

test_q = test_dataset[2]['prompt']
completion_q = test_dataset[2]['completion']
predicted_text = generate_output(test_question, finetuned_slightly_model)
base_predicted_text = generate_output(test_question, base_model)

print('Question:')
print(test_q)
print("--------------------------------------")
print('Actual Completion:')
print(completion_q)
print("--------------------------------------")
print('Fine-tuned prediction')
print(predicted_text)
print("--------------------------------------")
print('Base prediction:')
print(base_predicted_text)
print("--------------------------------------")

Question:
What role do tech companies play in promoting gender equality within the industry?
--------------------------------------
Actual Completion:
Tech companies play a crucial role in shaping industry norms. By implementing inclusive hiring practices, offering mentorship programs, and promoting women in leadership roles, they can set a standard for gender equality. Additionally, by addressing workplace cultures that may perpetuate bias, tech companies can foster more inclusive environments.
--------------------------------------
Fine-tuned prediction
Research articles on tech companies in terms of gender equality found that more women are gynaecologists and engineers than ever before. In 2022, tech companies in 2023 reached the top of the female lead in women's tech. And just like most tech companies, tech companies in terms of gender equality are still finding ways to make sure they have the resources they need to fight for gender equality. And just like most tech companies, tech
--------------------------------------
Base prediction:


The research is being conducted in the UK, with the aim of helping to understand the impact of gender equality on the UK’s economy.

The research is being conducted in the UK, with the aim of helping to understand the impact of gender equality on the UK’s economy.

The research is being conducted in the UK, with the aim of helping to understand the impact of gender equality on the UK’
--------------------------------------

Okay, my first impression is that the fine-tuned model’s answer ends in the middle of a sentence. This might be about the maximum number of tokens in the output.

How can we improve the performance?

🔎We can prepare a large dataset. (Our dataset has only 30 rows)
🔎We can use larger models. (This one has 70M parameters)
🔎We can change the hyperparameters during training.
🔎We can prepare the dataset by using <EOS> (It helped me on my another finetuning on GPT3.5)
🔎We can change the maximum number of tokens.
🔎We can train the model more epochs.

Evaluation

For detailed information, please take a look at my previous post here.

First, let’s make predictions for all test data:

tuned_predicted_text_list = []
actual_test_list = []
base_predicted_text_list = []
for i in range(len(test_dataset)):
    # get prompt
    test_q = test_dataset[i]['prompt']
    # get completion
    completion_q = test_dataset[i]['completion']
    # predictions
    predicted_text = generate_output(test_question, finetuned_slightly_model)
    base_predicted_text = generate_output(test_question, base_model)
    # collect 
    actual_test_list.append(completion_q)
    tuned_predicted_text_list.append(predicted_text)
    base_predicted_text_list.append(base_predicted_text)

# !pip install evaluate

import evaluate
bleu = evaluate.load("bleu")

results = bleu.compute(predictions=base_predicted_text_list, references=actual_test_list)
print("Base Model Predictions Results")
print(results)

results = bleu.compute(predictions=tuned_predicted_text_list, references=actual_test_list)
print("Fine-tuned Model Predictions Results")
print(results)

Here are the results:

Base Model Predictions Results:

BLEU Score: 0.0
Precisions: [0.1228, 0.0044, 0.0, 0.0]
Brevity Penalty: 1.0
Length Ratio: 1.5724
Translation Length: 228
Reference Length: 145
Fine Tuned Model Predictions Results:

BLEU Score: 0.0226
Precisions: [0.1934, 0.025, 0.0127, 0.0043]
Brevity Penalty: 1.0
Length Ratio: 1.6759
Translation Length: 243
Reference Length: 145

If you don’t understand and are having a hard time interpreting the results, don’t hesitate to use ChatGPT (like me! (. ❛ ᴗ ❛.)) These interpretations are from ChatGPT. 🐥

🥇BLEU Score: The base model has a BLEU score of 0.0, indicating that there is no overlap between the n-grams in the generated text and the reference text. This suggests that the base model’s generated text does not match the reference text at all in terms of n-gram precision.

Fine-Tuned Model: The fine-tuned model has a BLEU score of 0.0226, which is higher than the base model. While still relatively low, it suggests some improvement in matching n-grams between the generated text and the reference text compared to the base model.

🥇Precisions: Base Model: The base model has low precisions for all n-gram lengths, with the highest precision being 0.1228 for unigrams (single words). This further confirms that the generated text from the base model does not closely match the reference text in terms of content.

Fine-Tuned Model: The fine-tuned model has higher precisions for all n-gram lengths compared to the base model, with the highest precision being 0.1934 for unigrams. This indicates an improvement in content similarity between the generated text and the reference text after fine-tuning.

🥇Brevity Penalty: The brevity penalty is a factor that penalizes very short translations. In both cases, the brevity penalty is 1.0, indicating that the length of the generated text is not significantly different from the length of the reference text.
🥇Length Ratio: The length ratio represents the ratio of the length of the generated text to the length of the reference text.

Base Model: The base model has a length ratio of 1.5724, indicating that the generated text is shorter than the reference text.
Fine-Tuned Model: The fine-tuned model has a slightly higher length ratio of 1.6759, suggesting that it produces slightly longer text compared to the reference.

I hope that this article and my notebook will help you to understand finetuning better. Please let me know if you have any questions.

Happy learning! 😍