Finetune Small Language Model (SLM) using Azure Machine Learning

Manoranjan Rajguru
9 min readMar 9, 2024

--

Motivations for Small Language Models

· Efficiency: SLMs are computationally more efficient, requiring less memory and storage, and can operate faster due to fewer parameters to process.

· Cost: Training and deploying SLMs is less expensive, making them accessible to a wider range of businesses and suitable for applications in edge computing.

· Customizability: SLMs are more adaptable to specialized applications and can be fine-tuned for specific tasks more readily than larger models.

· Under-Explored Potential: While large models have shown clear benefits, the potential of smaller models trained with larger datasets has been less explored. SLM aims to showcase that smaller models can achieve high performance when trained with enough data.

· Inference Efficiency: Smaller models are often more efficient during inference, which is a critical aspect when deploying models in real-world applications with resource constraints. This efficiency includes faster response times and reduces computational and energy costs.

· Accessibility for Research: By being open-source and smaller in size, SLM is more accessible to a broader range of researchers who may not have the resources to work with larger models. It provides a platform for experimentation and innovation in language model research without requiring extensive computational resources.

· Advancements in Architecture and Optimization: SLM incorporates various architectural and speed optimizations to improve computational efficiency. These enhancements allow SLM to train faster and with less memory, making it feasible to train on commonly available GPUs.

· Open-Source Contribution: The authors of SLM have made the model checkpoints and code publicly available, contributing to the open-source community and enabling further advancements and applications by others.

· End-User Applications: With its excellent performance and compact size, SLM is suitable for end-user applications, potentially even on mobile devices, providing a lightweight platform for a wide range of applications.

· Training Data and Process: SLM training process is designed to be effective and reproducible, using a mixture of natural language data and code data, aiming to make pre-training accessible and transparent.

Phi-2 (Microsoft Research)

Phi-2 is the successor of Phi-1.5, the large language model (LLM) created by Microsoft.

To improve over Phi-1.5, in addition to doubling the number of parameters to 2.7 billion, Microsoft also extended the training data. Phi-2 outperforms Phi-1.5 and LLMs that are 25 times larger on several public benchmarks even though it is not aligned/fine-tuned. This is just a pre-trained model for research purposes only (non-commercial, non-revenue generating).

Forget about the exorbitant fees of larger language models. Phi-2 runs efficiently on even modest hardware, democratizing access to cutting-edge AI for startups and smaller businesses. No more sky-high cloud bills, just smart, affordable solutions on your own terms.

In this example, we are going to learn how to fine-tune phi-2 using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention.

QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

Step:1

Lets prepare the dataset. In this case we are going to download the Dolly 15k dataset.

from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011

Lets take a shorter version of the dataset to create training and test example. To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

lst_train = []
lst_eval = []
for i, item in enumerate(dataset):
if i < 7500:
lst_train.append({"instruction" : item["instruction"], "response" : item["response"]})
elif i > 7500and i < 15000:
lst_eval.append({"instruction" : item["instruction"], "response" : item["response"]})
elif i >=15000:
break
len(lst_eval)

Lets save this training and test dataset in json format.

import json
with open("data/train.jsonl", 'w') as out:
for ddict in lst_train:
jout = json.dumps(ddict) + '\n'
out.write(jout)
import json
with open("data/eval.jsonl", 'w') as out:
for ddict in lst_eval:
jout = json.dumps(ddict) + '\n'
out.write(jout)

Now let’s load the Azure ML SDK. This will help us create the necesary component.

# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

Now lets create the workspace client.

credential = DefaultAzureCredential()
workspace_ml_client = None
try:
workspace_ml_client = MLClient.from_config(credential)
except Exception as ex:
print(ex)
subscription_id= "Enter your subscription_id"
resource_group = "Enter your resource_group"
workspace= "Enter your workspace name"
workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Here lets create a custom training environment.

from azure.ai.ml.entities import Environment, BuildContext
env_docker_image = Environment(
image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:42",
conda_file="environment/conda.yml",
name="llm-training",
description="Environment created for llm training.",
)
ml_client.environments.create_or_update(env_docker_image)

Let’s look at the conda.yml

name: pydata-example
channels:
- conda-forge
dependencies:
- python=3.8
- pip=21.2.4
- pip:
- bitsandbytes
- transformers
- peft
- accelerate
- einops
- datasets

Lets look at the training script. We are going to use the recently introduced method in the paper “QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation” by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

  • Quantize the pretrained model to 4 bits and freezing it.
  • Attach small, trainable adapter layers. (LoRA)
  • Finetune only the adapter layers, while using the frozen quantized model for context.
%%writefile src/train.py

import os
#import mlflow
import argparse
import torch
import transformers
from datetime import datetime
from peft import LoraConfig, get_peft_model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
from datasets import load_dataset

def formatting_func(sample):
instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if sample.get("context") else None
response = f"### Answer\n{sample['response']}"
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return prompt

def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

def main(args):
train_dataset = load_dataset('json', data_files=args.train_file, split='train')
eval_dataset = load_dataset('json', data_files=args.eval_file, split='train')

base_model_id = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
use_fast=False, # needed for now, should be fixed soon
)
def generate_and_tokenize_prompt(prompt):
result = tokenizer(
formatting_func(prompt),
truncation=True,
max_length=max_length,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
tokenizer.pad_token = tokenizer.eos_token
max_length = 512 # This was an appropriate max length for my dataset

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=[
"Wqkv",
"fc1",
"fc2",
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
print_trainable_parameters(model)
model = accelerator.prepare_model(model)
if torch.cuda.device_count() > 1: # If more than 1 GPU
model.is_parallelizable = True
model.model_parallel = True

project = "journal-finetune"
base_model_name = "phi2"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

trainer = transformers.Trainer(
model=model,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
args=transformers.TrainingArguments(
output_dir=output_dir,
warmup_steps=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=1,
max_steps=500,
learning_rate=2.5e-5, # Want a small lr for finetuning
optim="paged_adamw_8bit",
logging_steps=25, # When to start reporting loss
logging_dir="./logs", # Directory for storing logs
save_strategy="steps", # Save the model checkpoint every logging step
save_steps=25, # Save checkpoints every 50 steps
evaluation_strategy="steps", # Evaluate the model every logging step
eval_steps=25, # Evaluate and save checkpoints every 50 steps
do_eval=True, # Perform evaluation at the end of training
#report_to="wandb", # Comment this out if you don't want to use weights & baises
#run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}" # Name of the W&B run (optional)
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()

#print("Rank %d: Finished Training" % (rank))
#if not distributed or rank == 0:
# log model
#mlflow.pytorch.log_model(model, "model")
os.makedirs(args.model_dir, exist_ok=True)
torch.save(model, os.path.join(args.model_dir, "model.pt"))
# mlflow.pytorch.save_model(model, f"{args.model_dir}/model")

def parse_args():
# setup argparse
parser = argparse.ArgumentParser()

# add arguments
parser.add_argument("--train-file", type=str, help="Input data for training")
parser.add_argument("--eval-file", type=str, help="Input data for eval")
parser.add_argument("--model-dir", type=str, default="./", help="output directory for model")
parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
parser.add_argument(
"--batch-size",
default=16,
type=int,
help="mini batch size for each gpu/process",
)
parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate")
parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
parser.add_argument(
"--print-freq",
default=200,
type=int,
help="frequency of printing training statistics",
)

# parse args
args = parser.parse_args()

# return args
return args


# run script
if __name__ == "__main__":
# parse args
args = parse_args()
# call main function
main(args)

Let’s create a training compute .

from azure.ai.ml.entities import AmlCompute
# If you have a specific compute size to work with change it here. By default we use the 1 x V100 compute from the above list
compute_cluster_size = "Standard_NC6s_v3"
# If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "gpu-cluster"
try:
compute = ml_client.compute.get(compute_cluster)
print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
print(
f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
)
try:
print("Attempt #1 - Trying to create a dedicated compute")
compute = AmlCompute(
name=compute_cluster,
size=compute_cluster_size,
tier="Dedicated",
max_instances=1, # For multi node training set this to an integer value more than 1
)
ml_client.compute.begin_create_or_update(compute).wait()
except Exception as e:
print("Error")

Now lets call the compute job with the above training script in the AML compute we just created.

# === Note on path ===
# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.
# Local paths are automatically uploaded to the default datastore in the cloud.
# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths

job = command(
inputs=dict(
train_file=Input(
type="uri_file",
path="data/train.jsonl",
),
eval_file=Input(
type="uri_file",
path="data/eval.jsonl",
),
epoch=1,
batchsize=64,
lr = 0.01,
momentum = 0.9,
prtfreq = 200,
output = "./outputs"
),
code="./src", # local path where the code is stored
command="python train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}",
environment="llm-training@latest",
compute= "gpu-cluster",
distribution={
"type": "PyTorch",
"process_count_per_instance": 1,
},
)
returned_job = workspace_ml_client.jobs.create_or_update(job)
workspace_ml_client.jobs.stream(returned_job.name)

Lets look at the pipeline output.

# check if the `trained_model` output is available
job_name = returned_job.name
print("pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)

Once the model is finetuned lets register the job in the workspace to create endpoint.

from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

run_model = Model(
path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder",
name="phi-2-dolly-finetuned",
description="Model created from run.",
type=AssetTypes.MLFLOW_MODEL,
)
model = workspace_ml_client.models.create_or_update(run_model)

Lets creat the endpoint.

endpoint_name = f"phi-2-dolly-finetuned-{str(uuid4())[:8]}"  # Replace with your endpoint name
endpoint_name
from azure.ai.ml.entities import (
ManagedOnlineEndpoint,
IdentityConfiguration,
ManagedIdentityConfiguration,
)

# Check if the endpoint already exists in the workspace
try:
endpoint = workspace_ml_client.online_endpoints.get(endpoint_name)
print("---Endpoint already exists---")
except:
# Create an online endpoint if it doesn't exist

# Define the endpoint
endpoint = ManagedOnlineEndpoint(
name=endpoint_name,
description=f"Test endpoint for {model.name}",
identity=IdentityConfiguration(
type="user_assigned",
user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],
)
if uai_id != ""
else None,
)

# Trigger the endpoint creation
try:
workspace_ml_client.begin_create_or_update(endpoint).wait()
print("\n---Endpoint created successfully---\n")
except Exception as err:
raise RuntimeError(
f"Endpoint creation failed. Detailed Response:\n{err}"
) from err
# Initialize deployment parameters

deployment_name = "phi2-dolly-deploy"
sku_name = "Standard_NCs_v3"

REQUEST_TIMEOUT_MS = 90000

deployment_env_vars = {
"SUBSCRIPTION_ID": subscription_id,
"RESOURCE_GROUP_NAME": resource_group,
"UAI_CLIENT_ID": uai_client_id,
}

For inferencing we will use a different base image.

from azure.ai.ml.entities import Model, Environment
env = Environment(
image='mcr.microsoft.com/azureml/curated/foundation-model-inference:23',
inference_config={
"liveness_route": {"port": 5001, "path": "/"},
"readiness_route": {"port": 5001, "path": "/"},
"scoring_route": {"port": 5001, "path": "/score"},
},
)

Lets deploy the model

from azure.ai.ml.entities import (
OnlineRequestSettings,
CodeConfiguration,
ManagedOnlineDeployment,
ProbeSettings,
Environment
)

deployment = ManagedOnlineDeployment(
name=deployment_name,
endpoint_name=endpoint_name,
model=model.id,
instance_type=sku_name,
instance_count=1,
#code_configuration=code_configuration,
environment = env,
environment_variables=deployment_env_vars,
request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS),
liveness_probe=ProbeSettings(
failure_threshold=30,
success_threshold=1,
period=100,
initial_delay=500,
),
readiness_probe=ProbeSettings(
failure_threshold=30,
success_threshold=1,
period=100,
initial_delay=500,
),
)

# Trigger the deployment creation
try:
workspace_ml_client.begin_create_or_update(deployment).wait()
print("\n---Deployment created successfully---\n")
except Exception as err:
raise RuntimeError(
f"Deployment creation failed. Detailed Response:\n{err}"
) from err

If you want to delete the endpoint please see the below code.

workspace_ml_client.online_deployments.begin_delete(name = deployment_name, 
endpoint_name = endpoint_name)
workspace_ml_client._online_endpoints.begin_delete(name = endpoint_name)

Hope this tutorial helps you in Finetuning and deploying Phi-2 model in Azure ML Studio.

Hope you like the blog. Please clap and follow me if you like to read more such blogs coming soon.

References:

https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

https://www.philschmid.de/sagemaker-falcon-180b-qlora

--

--

Manoranjan Rajguru

Generative AI at Microsoft, Previously at Amazon Web Services