Building a Conversational AI with Memory on AWS Series: Fine-tune (QLoRA) Falcon-7b with Dialogue Data and Deploy in Sagemaker

Yinzhou Wang
8 min readDec 1, 2023

--

#3 in the series: train, merge, and deploy Falcon-7b on Sagemaker

Source: https://huggingface.co/blog/falcon-180b

Introduction

Before fine-tuning, you need to set up Sagemaker Studio, preprocess your data, and upload it to S3. Please refer to my previous article on these topics.

QLoRA

This is an efficient fine-tuning technique allowing you to train an LLM with much fewer resources while preserving the performance. The idea is to load the model in 4-bit precision and only update a small number of parameters (the adapter weight). For a detailed explanation, please refer to this great article by Lokesh Todwal.

One thing we need to be careful of when using QLoRA is that we need to merge it before hosting it on Sagemaker Endpoint (I was using Text Generation Inference (TGI) as the inference image.) There is a convenient method merge_and_unload() we can use (I think TGI also uses this if you upload only the adapter), but according to the GitHub issue and my own experiment, this method will significantly worsen the performance. Here is the detailed explanation and experiments by Benjamin Marie:

What I found working is using the code by Chris Hayduk. The performance is good but the merged model can’t be quantized, meaning we need to host the 16-bit model (I will add this code below because this is the best solution so far). If you are seeking an alternative, please take a look at Benjamin Marie’s great article on QA-LoRA:

Fine-tuning Set up

In your Sagemaker Studio, create a new notebook (I used Basic Python 3.0 image and the default ml.t3.medium instance). Set up your Sagmeker Session and Role:

!pip install -q sagemaker --upgrade
import sagemaker
import boto3
sess = sagemaker.Session()
# session bucket allows you to upload data, checkpoints/model artifact, logs
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()

# role and sess are required for setting up huggingface estimator
try:
role = sagemaker.get_execution_role()
except ValueError:
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

Define your Huggingface Estimator, noticed that I save checkpoints, you can disable it by changing the “adapter_checkpoints” parameter:

from sagemaker.huggingface import HuggingFace
# Setup checkpoint folder
bucket=sess.default_bucket()
checkpoint_in_bucket="checkpoints"
job_name = 'YOUR_JOB_NAME'
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, job_name, checkpoint_in_bucket)
checkpoint_local_path="/opt/ml/checkpoints"

model_id = "tiiuae/falcon-7b"

# hyperparameters, change them for your case
hyperparameters ={
'model_id': model_id, # pre-trained model
'train_dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training datasets
'val_dataset_path': '/opt/ml/input/data/val',
'max_steps': 1000, # number of training steps
'per_device_train_batch_size': 4, # batch size for training
'gradient_accumulation_steps': 4, # effective batch size = 4 * 4 = 16
'max_seq_length': 512,
'merge_weights': True,
'lr': 3e-4, # learning rate used during training
'load_in_4bit': True,
'adapter_checkpoints': True,
'resume_from_checkpoint': False
}

# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'falcon.py', # train script
source_dir = 'training_script', # directory which includes all the files needed for training
instance_type = 'ml.g5.2xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
base_job_name = job_name, # the name of the training job
role = role, # IAM role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28', # the transformers version used in the training job
pytorch_version = '2.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters,
checkpoint_s3_uri = checkpoint_s3_bucket,
checkpoint_local_path= checkpoint_local_path,
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

Training Script

Notice that there is a “souce_dir” and “entry_point” parameter above. You will need to create a folder containing a training script and a requirement.text file. The huggingface training Image will install the required packages and run your training script.

In the requirement.txt file:

transformers==4.31.0
peft==0.4.0
accelerate==0.21.0
bitsandbytes==0.40.0
safetensors>=0.3.1
tokenizers>=0.13.3
trl==0.5.0
wandb # change it to your favoriate tracker, trainer will use it if it is installed

In your custom training script (falcon.py), import the necessary libraries and define a function that allows you to change the hyperparameters in the Estimator.

import os
import torch
import argparse
import shutil
import wandb
import json
import peft
import gc
import copy
import bitsandbytes as bnb
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
set_seed,
default_data_collator,
BitsAndBytesConfig,
Trainer,
TrainingArguments,
IntervalStrategy
)
from trl import SFTTrainer
from datasets import load_from_disk
from peft.utils import _get_submodules
from bitsandbytes.functional import dequantize_4bit
from transformers.trainer_utils import get_last_checkpoint
from peft import PeftConfig, PeftModel, LoraConfig, AutoPeftModelForCausalLM


def parse_arge():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_id",
type=str,
default="tiiuae/falcon-7b",
help="Model id to use for training.",
)
parser.add_argument("--resume_from_checkpoint", type=bool, default=False, help="Continue training this checkpoint")
parser.add_argument("--train_dataset_path", type=str, default="lm_dataset", help="Path to train dataset.")
parser.add_argument("--val_dataset_path", type=str, default="lm_dataset", help="Path to validation dataset.")
parser.add_argument("--max_steps", type=int, default=200, help="Number of steps to train for.")
parser.add_argument(
"--per_device_train_batch_size",
type=int,
default=1,
help="Batch size to use for training.",
)
parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate to use for training.")
parser.add_argument("--seed", type=int, default=42, help="Seed to use for training.")
parser.add_argument("--gradient_accumulation_steps", type=int, default=4, help="batches accumulated before backprop.")
parser.add_argument("--max_seq_length", type=int, default=512, help="Sequence length for training")
parser.add_argument(
"--merge_weights",
type=bool,
default=True,
help="Whether to merge LoRA weights with base model.",
)
parser.add_argument(
"--load_in_4bit",
type=bool,
default=True,
help="Whether train the model in 4 bit",
)
parser.add_argument(
"--adapter_checkpoints",
type=bool,
default=False,
help="Whether should save the LoRA weights",
)
args = parser.parse_known_args()
return args

Add the merge method. It basically dequantizes the model into 16-bit and merges the model with the adapter, which is also in 16-bit:

def save_model(model, tokenizer, to):
print(f"Saving dequantized model to {to}...")
model.save_pretrained(to, safe_serialization=True)
tokenizer.save_pretrained(to)
config_data = json.loads(open(os.path.join(to, 'config.json'), 'r').read())
config_data.pop("quantization_config", None)
config_data.pop("pretraining_tp", None)
with open(os.path.join(to, 'config.json'), 'w') as config:
config.write(json.dumps(config_data, indent=2))

def dequantize_model(model, to='./dequantized_model', dtype=torch.float16, device="cuda"):
"""
'model': the peftmodel you loaded with qlora.
'tokenizer': the model's corresponding hf's tokenizer.
'to': directory to save the dequantized model
'dtype': dtype that the model was trained using
'device': device to load the model to
"""

os.makedirs(to, exist_ok=True)

cls = bnb.nn.Linear4bit

with torch.no_grad():
for name, module in model.named_modules():
if isinstance(module, cls):
quant_state = copy.deepcopy(module.weight.quant_state)
quant_state[2] = dtype

weights = dequantize_4bit(module.weight.data, quant_state=quant_state, quant_type="nf4").to(dtype)

new_module = torch.nn.Linear(module.in_features, module.out_features, bias=None, dtype=dtype)
new_module.weight = torch.nn.Parameter(weights)
new_module.to(device=device, dtype=dtype)

parent, target, target_name = _get_submodules(model, name)
setattr(parent, target_name, new_module)

# a hack, setting this to avoid hf's saving error because hf
# itself does not support saving a model that is registered to be loaded in 4bit.
model.is_loaded_in_4bit = False

print("Saving dequantized model...")
model.save_pretrained(to)
#tokenizer.save_pretrained(to)
config_data = json.loads(open(os.path.join(to, 'config.json'), 'r').read())
config_data.pop("quantization_config", None)
config_data.pop("pretraining_tp", None)
with open(os.path.join(to, 'config.json'), 'w') as config:
config.write(json.dumps(config_data, indent=2))

return model

def merge(adapter, args):
#To which precision do you want to dequantize?
dtype = torch.float16

quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=dtype,
bnb_4bit_quant_type="nf4",
)

model_name = args.model_id
tokenizer = AutoTokenizer.from_pretrained(adapter, trust_remote_code=True)

try:
print(f"Starting to load the model {model_name} into memory")

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
trust_remote_code=True
)
model = dequantize_model(model, to='./dqz_model/',dtype=dtype)
model = PeftModel.from_pretrained(model, adapter)
model = model.merge_and_unload()

print(f"Successfully loaded the model {model_name} into memory")
save_model(model, tokenizer, "/opt/ml/model/") # changed
print(f"Merged model saved")
except Exception as e:
print(f"An error occurred: {e}")

Then there is the training function:

def training_function(args):
# set seed
set_seed(args.seed)
wandb.login(key="YOUR_WANDB_KEY")

# load train and validation Datasets
train_dataset = load_from_disk(args.train_dataset_path)
val_dataset = load_from_disk(args.val_dataset_path)

# Quantization Configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)

# Load model depends on precision
if args.load_in_4bit:
model = AutoModelForCausalLM.from_pretrained(
args.model_id,
trust_remote_code=True,
device_map="auto",
quantization_config=bnb_config,
)
else:
model = AutoModelForCausalLM.from_pretrained(
args.model_id,
trust_remote_code=True,
device_map="auto",
)

model.config.use_cache = False

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# lora config
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"query_key_value",
"dense",
"dense_h_to_4h",
"dense_4h_to_h",
],
)

if args.adapter_checkpoints:
output_dir = "/opt/ml/checkpoints"
save_strategy = "steps"
else:
output_dir = "/tmp"
save_strategy = "no"

# Define training args, change these hyperparameters as needed
training_args = TrainingArguments(
output_dir=output_dir,
overwrite_output_dir=True,
optim="paged_adamw_32bit",
per_device_train_batch_size=args.per_device_train_batch_size,
gradient_accumulation_steps=args.gradient_accumulation_steps,
fp16=True,
learning_rate=args.lr,
max_grad_norm = 0.3,
max_steps=args.max_steps,
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
save_strategy = save_strategy,
save_steps = 10,
save_total_limit = 30,
logging_steps=10,
group_by_length=True,
warmup_ratio = 0.03,
lr_scheduler_type = "constant",
evaluation_strategy = IntervalStrategy.STEPS,
eval_steps = 10,
)

# Create Trainer instance
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
max_seq_length = args.max_seq_length,
dataset_text_field="text",
peft_config=peft_config,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)

# Upcast norm layers to 32 for stability
for name, module in trainer.model.named_modules():
if "norm" in name:
module = module.to(torch.float32)

# Start training
if args.resume_from_checkpoint:
last_checkpoint = get_last_checkpoint(output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
trainer.train()

# final adapter
final_adapter = get_last_checkpoint(output_dir) if args.adapter_checkpoints else output_dir

# release some memories
del model
del trainer
torch.cuda.empty_cache()

# merge
if args.merge_weights and not args.load_in_4bit:
# use the default merge_and_unload method if model not load in 4bit
# load PEFT model in fp16
model = AutoPeftModelForCausalLM.from_pretrained(
final_adapter,
torch_dtype=torch.float16,
trust_remote_code=True
)
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("/opt/ml/model/",safe_serialization=True)
tokenizer = AutoTokenizer.from_pretrained(adapter, trust_remote_code=True)
tokenizer.save_pretrained("/opt/ml/model/")
elif args.merge_weights and args.load_in_4bit:
merge(final_adapter, args)

def main():
args, _ = parse_arge()
training_function(args)

# run the function
if __name__ == "__main__":
main()

Start Training

Then go back to your notebook, use the previously uploaded data in S3, and begin the training.

train_input_path = 'TRAIN_DATA_PATH'
val_input_path = 'VAL_DATA_PATH'

data = {'training': train_input_path, 'val':val_input_path}
huggingface_estimator.fit(data, wait=True)

# This print where the model artifact is stored
model_data_uri = huggingface_estimator.model_data
print(model_data_uri)

Deploy Model

After the training, we can deploy the model. First, let’s define an inference image. Since we are using the huggingface model, let’s use TGI for efficient inference. Check available images here:

llm_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04"

Now, let’s deploy the model

import json
from sagemaker.huggingface.model import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1 # this has to be 1 for falcon-7b models since they don't support model sharding
health_check_timeout = 300
trust_remote_code = True

# Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
'NUM_SHARD': json.dumps(1),
#'HF_MODEL_QUANTIZE': "bitsandbytes-nf4", # DON'T Use this because the merged model can't be quantizedld
}

# create Hugging Face Model Class
MODEL_PATH = "PATH_TO_MODEL_ARTIFACT"

# These are previously defined
huggingface_model = HuggingFaceModel(
model_data=MODEL_PATH, # path to your trained SageMaker model
role=role, # IAM role with permissions to create an endpoint
env=config,
image_uri = llm_image,
container_log_level=20,
sagemaker_session=sess,
)

#deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=health_check_timeout,
endpoint_name = "falcon-7b-v1"
)

Conclusion

In this article, I illustrated how to fine-tune, merge, and deploy Falcon-7b on Sagemaker. The next article will be llama2–7b-chat, which is a little bit different. Any comments, suggestions, or critique is welcome!

Check my previous articles in this series:

--

--

Yinzhou Wang

Large Language Model and mental health; Human-Language Model Interaction; Digital Mental Health Interventions