Finetuning Microsoft phi-2 Small Language Model on viggo dataset using LoRA

Nimrita Koul
9 min readJan 13, 2024
Image Source: https://huggingface.co/microsoft/phi-2

phi-2 is a transformer-based small language model from Microsoft. It is available under MIT License on HuggingFace.

It is trained on 1.4T tokens for 14 days on 96 A100 GPUs. Phi-2 is a 2.7 billion parameters pre-trained Transformer which does not use RLHF or instruct fine-tuning. It does next token prediction and can be used for text generation in question answering, chat format and code generation.

phi-2 has been demonstrated to outperform many models with 7B and 13B parameters on multiple benchmarks and tasks like coding and math.

The reason behind excellent performance of small language models is the use of distilled, high-quality training data or “textbook-quality” data. Small language models use Knowledge Distilling. I.e., they are trained on the core/essential knowledge distilled from LLMS. Then pruning and quantization techniques are employed to remove non-essential parts of the model. Training data is often a mixture of synthetic datasets that are purposefully created to teach the model to perform common-sense reasoning and general knowledge in the fields of science, daily activities, theory of mind etc. It may also contain selective web data with high educational value and quality. Small language models uses innovative technologies for scaling up.

Next, we will see the step by step Python code about how to use phi-2 from HuggingFace for prompting and then we will fine tune it on veggo dataset. I ran this code notebook on the free tier of Google Colab with T4 GPU.

My code is borrowed from this excellent tutorial from Harper Carrol at GitHub.

  1. Install the required libraries
#@title Install required libraries
!pip install accelerate==0.25.0
!pip install bitsandbytes==0.41.1
!pip install datasets==2.14.6
!pip install peft==0.6.2
!pip install transformers==4.36.2
!pip install torch==2.1.0
!pip install einops==0.4.1
!pip install huggingface_hub

2.Required Imports

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline, logging
from datasets import Dataset

3.We will use the cuda device on Google Colab Free tier (T4) to run the model

torch.set_default_device("cuda")

4.Create the model and the tokenizer

#create the model object and the corresponding tokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

5. Let us run some prompts and see the models response

# https://huggingface.co/microsoft/phi-2
# This prompt is for code completion
# here the prompt is written within the tokenizer()
inputs = tokenizer('''def fibonacci(n):
"""
This function prints the terms in Fibonacci series upto n
"""''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=100)
text = tokenizer.batch_decode(outputs)[0]
print(text)
#https://huggingface.co/microsoft/phi-2
# here a string containing the prompt is defined separately from the tokenizer() and then passed to it
prompt = '''def fibonacci(n):
"""
This function prints the terms in Fibonacci series upto n
"""'''
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=100)
text = tokenizer.batch_decode(outputs)[0]
print(text)
# here we see the output of phi-2 for a question-answering prompt
prompt = 'What is thee relevance of mathematics for understanding physics?'
inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Now we will fine tune phi-2 model on “veggo” dataset from HuggingFace.

ViGGO is an English data-to-text generation dataset in the video game domain. The target responses are conversational presented in a meaning representation. The dataset has about 5,000 very clean datapoints therefore this dataset can be used to evaluate transfer learning, low-resource, or few-shot capabilities of neural models.

6. Let us set up the accelerator to speed up the training/finetuning

#@title Set up accelerator to speed up the training/finetuning
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

7. Login to your huggingface account using a valid HuggingFace access token. You should have an account on HuggingFace and then you can create a free access token.

#@title login to your huggingface account using your access token
# you can find your access token at https://huggingface.co/settings/tokens
from huggingface_hub import notebook_login
notebook_login()

8. Load the viggo dataset

#@title load viggo dataset
from datasets import load_dataset

train_dataset = load_dataset('gem/viggo', split='train')
eval_dataset = load_dataset('gem/viggo', split='validation')
test_dataset = load_dataset('gem/viggo', split='test')

9. Load base model phi-2

#@title load base model microsoft/phi-2 
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling

base_model_id = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(base_model_id,
load_in_8bit=True,
torch_dtype=torch.float16,
trust_remote_code=True)

10. In the code cell below, we setup the tokenizer object, the tokenize() function applies tokenizer to each prompt and creates a “labels” column with same values as “input_ids” column in data. generate_and_tokenize_prompt() function converts each data point into the format of a prompt that is suitable to pass to the phi-2 model.It extracts “”target” and “meaning_representation” from a data point. Finally we use map() function to apply this function to every datapoint in the train and val datasets.

#@title set up the tokenizer for base model
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
add_eos_token=True,
add_bos_token=True,
use_fast=False, # needed for now, should be fixed soon
)

#@title setup tokenize function to make labels and input_ids the same for the self-supervised fine-tuning.
def tokenize(prompt):
result = tokenizer(prompt)
result["labels"] = result["input_ids"].copy()
return result

#@title convert each sample into a prompt

def generate_and_tokenize_prompt(data_point):
full_prompt =f"""Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
{data_point["target"]}

### Meaning representation:
{data_point["meaning_representation"]}
"""
return tokenize(full_prompt)




#@title Reformat the prompt and tokenize each sample:

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

11. Input tensors to a model often uses a max_length parameter to pad each input to a uniform lenght. To determine the value of this parameter, we can plot a distribution of lengths of each of input_ids and set the max_length equal to the length of the longest input_id. In this case the max_length chosen is 320.

12. Next we will apply tokenize() again with max_length parameter equal to 320.

max_length = 320 # appropriate max length for this dataset

# redefine the tokenize function and tokenizer

tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
trust_remote_code=True,
use_fast=False, # needed for now, should be fixed soon
)
tokenizer.pad_token = tokenizer.eos_token


def tokenize(prompt):
result = tokenizer(
prompt,
truncation=True,
max_length=max_length,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result


#@title tokenize train and validation datasets using generate_and_tokenize_prompt function
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

13. Let us use LoRA (Low Rank Adaptation) to fine tune phi-2

Low rank adaptation is a technique to fine tune large langauge models quickly. It freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the transformer architecture, reducing the number of trainable parameters for the downstream tasks. It can reduce the number of trainable parameters by 10000 times and GPU memory requirements by 3 times.

To fine-tune a model using LoRA, you need to:

  1. Instantiate a base model.
  2. Create a configuration (LoraConfig) where you define LoRA-specific parameters.
  3. Wrap the base model with get_peft_model() to get a trainable PeftModel.
  4. Train the PeftModel as you normally would train the base model.

LoraConfig allows you to control how LoRA is applied to the base model through the following parameters:

  • r: the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
  • target_modules: The modules (for example, attention blocks) to apply the LoRA update matrices.
  • alpha: LoRA scaling factor.
  • bias: Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.
  • modules_to_save: List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model’s custom head that is randomly initialized for the fine-tuning task.
  • layers_to_transform: List of layers to be transformed by LoRA. If not specified, all layers in target_modules are transformed.
  • layers_pattern: Pattern to match layer names in target_modules, if layers_to_transform is specified. By default PeftModel will look at common layer pattern (layers, h, blocks, etc.), use it for exotic and custom models.
  • rank_pattern: The mapping from layer names or regexp expression to ranks which are different from the default rank specified by r.
  • alpha_pattern: The mapping from layer names or regexp expression to alphas which are different from the default alpha specified by lora_alpha.

We will apply LoRA to the layers Wqkv, fc1, fc2 of the model.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"Wqkv",
"fc1",
"fc2",
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)


# Apply the acceleratort to the model for faster traning.
model = accelerator.prepare_model(model)

14. Finetuning/training the model using LoRA

You will need to set the training arguments or configuration parameters like output directory in which your model should be saved. I am saving/pushing my fine-tuned model to my HuggingFace account, you also also save the fine-tuned model on your local directory or Colab directory.

Other training arguments are warmup_steps, per_device_train_batch_size, gradient_accumulation_steps, max_steps, learning_rate, logging_steps, optim,logging_dir,save_strategy,save_steps,evaluation_strategy, eval_steps, do_eval, push_to_hub, report_to, run_name etc.

maz_steps determines the maximum training steps to perform, longer it is more finetuned your model is and longer is the time taken to finish training. With max_steps = 1000, it took me 90 minutes to train on free tier Google Colab. Learning rate also influences the training time.

#Train the model and push each check point to Huggingface
import transformers


tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
model=model,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
args=transformers.TrainingArguments(
output_dir="./phi2-finetunedonviggodataset",
warmup_steps=5,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
max_steps=500,
learning_rate=2.5e-5,
logging_steps=50,
optim="paged_adamw_8bit",
logging_dir="./logs", # Directory for storing logs
save_strategy="steps", # Save the model checkpoint every logging step
save_steps=50, # Save checkpoints every 50 steps
evaluation_strategy="steps", # Evaluate the model every logging step
eval_steps=50, # Evaluate and save checkpoints every 50 steps
do_eval=True, # Perform evaluation at the end of training
push_to_hub=True,

),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
trainer.train()

Now you have fine-tuned phi-2 on viggo dataset and it is saved in output_dir or in your Huggingface account.

16. Next, we will compare the performance of a sample prompt on the base model (with no finetuning) and that of the fine-tuned model (your trained above)

#Load the base model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/phi-2"

base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
load_in_8bit=True,
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.float16,
)

eval_tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
add_bos_token=True,
trust_remote_code=True,
use_fast=False,
)

#create a sample prompt for evaluation on base model
eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

# tokenize the above prompt and generate the response from base model
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to('cuda')
base_model.eval()
with torch.no_grad():
print(eval_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

17. Now let us load the finetuned model from my HuggingFace account and test the same prompt on it.

from peft import PeftModel
ft_model = PeftModel.from_pretrained(base_model, "nimrita/phi2-finetunedonviggodataset", force_download=True)


eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']

### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?

### Meaning representation:
"""

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to('cuda')
ft_model = ft_model.to('cuda')
ft_model.eval()
with torch.no_grad():
print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


Hurray, you have just fine-tuned phi-2.

References:

[1]. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

[2].https://huggingface.co/microsoft/phi-2

[3]. https://github.com/brevdev/notebooks/blob/main/phi2-finetune.ipynb

[4]. https://www.theaidream.com/post/exploring-phi-2-microsoft-s-latest-small-language-model-slm

--

--