Building a Conversational AI with Memory on AWS Series: Fine-tune (QLoRA) Falcon-7b with Dialogue Data and Deploy in Sagemaker

8 min readDec 1, 2023

#3 in the series: train, merge, and deploy Falcon-7b on Sagemaker

Source: https://huggingface.co/blog/falcon-180b

Introduction

Before fine-tuning, you need to set up Sagemaker Studio, preprocess your data, and upload it to S3. Please refer to my previous article on these topics.

Building a Conversational AI with Memory on AWS Series: Prepare Your Dialogue Data for Fine-tuning…

#2 in the series: format your conversation data into Llama2-chat/Falcon prompt

medium.com

QLoRA

This is an efficient fine-tuning technique allowing you to train an LLM with much fewer resources while preserving the performance. The idea is to load the model in 4-bit precision and only update a small number of parameters (the adapter weight). For a detailed explanation, please refer to this great article by Lokesh Todwal.

Demystifying LoRA & Q-LoRA

In today’s data science universe, large language models (LLMs) are gaining huge attraction and this attraction is…

medium.com

One thing we need to be careful of when using QLoRA is that we need to merge it before hosting it on Sagemaker Endpoint (I was using Text Generation Inference (TGI) as the inference image.) There is a convenient method merge_and_unload() we can use (I think TGI also uses this if you upload only the adapter), but according to the GitHub issue and my own experiment, this method will significantly worsen the performance. Here is the detailed explanation and experiments by Benjamin Marie:

Don’t Merge Your LoRA Adapter Into a 4-bit LLM

16-bit and quantization-unaware adapters

medium.com

What I found working is using the code by Chris Hayduk. The performance is good but the merged model can’t be quantized, meaning we need to host the 16-bit model (I will add this code below because this is the best solution so far). If you are seeking an alternative, please take a look at Benjamin Marie’s great article on QA-LoRA:

QA-LoRA: Fine-Tune a Quantized Large Language Model on Your GPU

Quantization-aware fine-tuning

towardsdatascience.com

Fine-tuning Set up

In your Sagemaker Studio, create a new notebook (I used Basic Python 3.0 image and the default ml.t3.medium instance). Set up your Sagmeker Session and Role:

!pip install -q sagemaker --upgrade

import sagemaker
import boto3
sess = sagemaker.Session()
# session bucket allows you to upload data, checkpoints/model artifact, logs
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

# role and sess are required for setting up huggingface estimator
try:
    role = sagemaker.get_execution_role()
except ValueError:
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

Define your Huggingface Estimator, noticed that I save checkpoints, you can disable it by changing the “adapter_checkpoints” parameter:

from sagemaker.huggingface import HuggingFace
# Setup checkpoint folder
bucket=sess.default_bucket()
checkpoint_in_bucket="checkpoints"
job_name = 'YOUR_JOB_NAME'
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, job_name, checkpoint_in_bucket)
checkpoint_local_path="/opt/ml/checkpoints"

model_id = "tiiuae/falcon-7b"

# hyperparameters, change them for your case
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'train_dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training datasets
  'val_dataset_path': '/opt/ml/input/data/val',
  'max_steps': 1000,                                   # number of training steps
  'per_device_train_batch_size': 4,                    # batch size for training
  'gradient_accumulation_steps': 4,                    # effective batch size = 4 * 4 = 16
  'max_seq_length': 512,
  'merge_weights': True,
  'lr': 3e-4,                                          # learning rate used during training
  'load_in_4bit': True,
  'adapter_checkpoints': True,
  'resume_from_checkpoint': False
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'falcon.py',       # train script
    source_dir           = 'training_script', # directory which includes all the files needed for training
    instance_type        = 'ml.g5.2xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # IAM role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,
    checkpoint_s3_uri    =  checkpoint_s3_bucket,
    checkpoint_local_path=  checkpoint_local_path,
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

Training Script

Notice that there is a “souce_dir” and “entry_point” parameter above. You will need to create a folder containing a training script and a requirement.text file. The huggingface training Image will install the required packages and run your training script.

In the requirement.txt file:

transformers==4.31.0
peft==0.4.0
accelerate==0.21.0
bitsandbytes==0.40.0
safetensors>=0.3.1
tokenizers>=0.13.3
trl==0.5.0
wandb # change it to your favoriate tracker, trainer will use it if it is installed

In your custom training script (falcon.py), import the necessary libraries and define a function that allows you to change the hyperparameters in the Estimator.

import os
import torch
import argparse
import shutil
import wandb
import json
import peft
import gc
import copy
import bitsandbytes as bnb
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
    default_data_collator,
    BitsAndBytesConfig,
    Trainer,
    TrainingArguments,
    IntervalStrategy
)
from trl import SFTTrainer
from datasets import load_from_disk
from peft.utils import _get_submodules
from bitsandbytes.functional import dequantize_4bit
from transformers.trainer_utils import get_last_checkpoint
from peft import PeftConfig, PeftModel, LoraConfig, AutoPeftModelForCausalLM


def parse_arge():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_id",
        type=str,
        default="tiiuae/falcon-7b",
        help="Model id to use for training.",
    )
    parser.add_argument("--resume_from_checkpoint", type=bool, default=False, help="Continue training this checkpoint")
    parser.add_argument("--train_dataset_path", type=str, default="lm_dataset", help="Path to train dataset.")
    parser.add_argument("--val_dataset_path", type=str, default="lm_dataset", help="Path to validation dataset.")
    parser.add_argument("--max_steps", type=int, default=200, help="Number of steps to train for.")
    parser.add_argument(
        "--per_device_train_batch_size",
        type=int,
        default=1,
        help="Batch size to use for training.",
    )
    parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate to use for training.")
    parser.add_argument("--seed", type=int, default=42, help="Seed to use for training.")
    parser.add_argument("--gradient_accumulation_steps", type=int, default=4, help="batches accumulated before backprop.")
    parser.add_argument("--max_seq_length", type=int, default=512, help="Sequence length for training")
    parser.add_argument(
        "--merge_weights",
        type=bool,
        default=True,
        help="Whether to merge LoRA weights with base model.",
    )
    parser.add_argument(
        "--load_in_4bit",
        type=bool,
        default=True,
        help="Whether train the model in 4 bit",
    )
    parser.add_argument(
        "--adapter_checkpoints",
        type=bool,
        default=False,
        help="Whether should save the LoRA weights",
    )
    args = parser.parse_known_args()
    return args

Add the merge method. It basically dequantizes the model into 16-bit and merges the model with the adapter, which is also in 16-bit:

def save_model(model, tokenizer, to):
    print(f"Saving dequantized model to {to}...")
    model.save_pretrained(to, safe_serialization=True)
    tokenizer.save_pretrained(to)
    config_data = json.loads(open(os.path.join(to, 'config.json'), 'r').read())
    config_data.pop("quantization_config", None)
    config_data.pop("pretraining_tp", None)
    with open(os.path.join(to, 'config.json'), 'w') as config:
        config.write(json.dumps(config_data, indent=2))

def dequantize_model(model, to='./dequantized_model', dtype=torch.float16, device="cuda"):
    """
    'model': the peftmodel you loaded with qlora.
    'tokenizer': the model's corresponding hf's tokenizer.
    'to': directory to save the dequantized model
    'dtype': dtype that the model was trained using
    'device': device to load the model to
    """

    os.makedirs(to, exist_ok=True)

    cls = bnb.nn.Linear4bit

    with torch.no_grad():
        for name, module in model.named_modules():
            if isinstance(module, cls):
                quant_state = copy.deepcopy(module.weight.quant_state)
                quant_state[2] = dtype

                weights = dequantize_4bit(module.weight.data, quant_state=quant_state, quant_type="nf4").to(dtype)

                new_module = torch.nn.Linear(module.in_features, module.out_features, bias=None, dtype=dtype)
                new_module.weight = torch.nn.Parameter(weights)
                new_module.to(device=device, dtype=dtype)

                parent, target, target_name = _get_submodules(model, name)
                setattr(parent, target_name, new_module)

        # a hack, setting this to avoid hf's saving error because hf
        # itself does not support saving a model that is registered to be loaded in 4bit.
        model.is_loaded_in_4bit = False

        print("Saving dequantized model...")
        model.save_pretrained(to)
        #tokenizer.save_pretrained(to)
        config_data = json.loads(open(os.path.join(to, 'config.json'), 'r').read())
        config_data.pop("quantization_config", None)
        config_data.pop("pretraining_tp", None)
        with open(os.path.join(to, 'config.json'), 'w') as config:
            config.write(json.dumps(config_data, indent=2))

        return model

def merge(adapter, args):
    #To which precision do you want to dequantize?
    dtype = torch.float16

    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=dtype,
        bnb_4bit_quant_type="nf4",
    )

    model_name = args.model_id
    tokenizer = AutoTokenizer.from_pretrained(adapter, trust_remote_code=True)

    try:
        print(f"Starting to load the model {model_name} into memory")

        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=quantization_config,
            trust_remote_code=True
        )
        model = dequantize_model(model, to='./dqz_model/',dtype=dtype)
        model = PeftModel.from_pretrained(model, adapter)
        model = model.merge_and_unload()

        print(f"Successfully loaded the model {model_name} into memory")
        save_model(model, tokenizer, "/opt/ml/model/") # changed
        print(f"Merged model saved")
    except Exception as e:
        print(f"An error occurred: {e}")

Then there is the training function:

def training_function(args):
    # set seed
    set_seed(args.seed)
    wandb.login(key="YOUR_WANDB_KEY")
    
    # load train and validation Datasets
    train_dataset = load_from_disk(args.train_dataset_path)
    val_dataset = load_from_disk(args.val_dataset_path)
    
    # Quantization Configuration
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )
    
    # Load model depends on precision
    if args.load_in_4bit:
        model = AutoModelForCausalLM.from_pretrained(
            args.model_id,
            trust_remote_code=True,
            device_map="auto",
            quantization_config=bnb_config,
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            args.model_id,
            trust_remote_code=True,
            device_map="auto",
        )
    
    model.config.use_cache = False
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token 
    
    # lora config
    lora_alpha = 16
    lora_dropout = 0.1
    lora_r = 64

    peft_config = LoraConfig(
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        r=lora_r,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=[
            "query_key_value",
            "dense",
            "dense_h_to_4h",
            "dense_4h_to_h",
        ],
    )

    if args.adapter_checkpoints:
        output_dir = "/opt/ml/checkpoints"
        save_strategy = "steps"
    else:
        output_dir = "/tmp"
        save_strategy = "no"
    
    # Define training args, change these hyperparameters as needed
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        optim="paged_adamw_32bit",
        per_device_train_batch_size=args.per_device_train_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        fp16=True,  
        learning_rate=args.lr,
        max_grad_norm = 0.3,
        max_steps=args.max_steps,
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        save_strategy = save_strategy,
        save_steps = 10,
        save_total_limit = 30,
        logging_steps=10,
        group_by_length=True,
        warmup_ratio = 0.03,
        lr_scheduler_type = "constant",
        evaluation_strategy = IntervalStrategy.STEPS, 
        eval_steps = 10,
    )

    # Create Trainer instance
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        max_seq_length = args.max_seq_length,
        dataset_text_field="text",
        peft_config=peft_config,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
    )

    # Upcast norm layers to 32 for stability
    for name, module in trainer.model.named_modules():
        if "norm" in name:
            module = module.to(torch.float32)

    # Start training
    if args.resume_from_checkpoint:
        last_checkpoint = get_last_checkpoint(output_dir)
        trainer.train(resume_from_checkpoint=last_checkpoint)
    else:
        trainer.train()
    
    # final adapter
    final_adapter = get_last_checkpoint(output_dir) if args.adapter_checkpoints else output_dir

    # release some memories
    del model
    del trainer
    torch.cuda.empty_cache()

    # merge
    if args.merge_weights and not args.load_in_4bit:
        # use the default merge_and_unload method if model not load in 4bit
        # load PEFT model in fp16
        model = AutoPeftModelForCausalLM.from_pretrained(
            final_adapter,
            torch_dtype=torch.float16,
            trust_remote_code=True
        )  
        # Merge LoRA and base model and save
        merged_model = model.merge_and_unload()
        merged_model.save_pretrained("/opt/ml/model/",safe_serialization=True)
        tokenizer = AutoTokenizer.from_pretrained(adapter, trust_remote_code=True)
        tokenizer.save_pretrained("/opt/ml/model/")
    elif args.merge_weights and args.load_in_4bit:
        merge(final_adapter, args)

def main():
    args, _ = parse_arge()
    training_function(args)

# run the function
if __name__ == "__main__":
    main()

Start Training

Then go back to your notebook, use the previously uploaded data in S3, and begin the training.

train_input_path = 'TRAIN_DATA_PATH'
val_input_path = 'VAL_DATA_PATH'

data = {'training': train_input_path, 'val':val_input_path}
huggingface_estimator.fit(data, wait=True)

# This print where the model artifact is stored
model_data_uri = huggingface_estimator.model_data
print(model_data_uri)

Deploy Model

After the training, we can deploy the model. First, let’s define an inference image. Since we are using the huggingface model, let’s use TGI for efficient inference. Check available images here:

tgi · Releases · aws/deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow…

github.com

llm_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04"

Now, let’s deploy the model

import json
from sagemaker.huggingface.model import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1 # this has to be 1 for falcon-7b models since they don't support model sharding
health_check_timeout = 300
trust_remote_code = True

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
  'NUM_SHARD': json.dumps(1),
   #'HF_MODEL_QUANTIZE': "bitsandbytes-nf4", # DON'T Use this because the merged model can't be quantizedld 
}

# create Hugging Face Model Class
MODEL_PATH = "PATH_TO_MODEL_ARTIFACT"

# These are previously defined
huggingface_model = HuggingFaceModel(
   model_data=MODEL_PATH,                                # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   env=config,
   image_uri = llm_image,
   container_log_level=20,
   sagemaker_session=sess,
)

#deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type=instance_type,
   container_startup_health_check_timeout=health_check_timeout,
   endpoint_name = "falcon-7b-v1"
)

Conclusion

In this article, I illustrated how to fine-tune, merge, and deploy Falcon-7b on Sagemaker. The next article will be llama2–7b-chat, which is a little bit different. Any comments, suggestions, or critique is welcome!

Check my previous articles in this series:

Building a Conversational AI with Memory on AWS Series: Fine-tune (QLoRA) Falcon-7b with Dialogue Data and Deploy in Sagemaker

Introduction

Building a Conversational AI with Memory on AWS Series: Prepare Your Dialogue Data for Fine-tuning…

#2 in the series: format your conversation data into Llama2-chat/Falcon prompt

QLoRA

Demystifying LoRA & Q-LoRA

In today’s data science universe, large language models (LLMs) are gaining huge attraction and this attraction is…

Don’t Merge Your LoRA Adapter Into a 4-bit LLM

16-bit and quantization-unaware adapters

QA-LoRA: Fine-Tune a Quantized Large Language Model on Your GPU

Quantization-aware fine-tuning

Fine-tuning Set up

Training Script

Start Training

Deploy Model

tgi · Releases · aws/deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow…

Conclusion

Building a Conversational AI with Memory on AWS Series: Prepare Your Dialogue Data for Fine-tuning…

#2 in the series: format your conversation data into Llama2-chat/Falcon prompt

Building a Conversational AI with Memory on AWS Series: AWS Overview

#1 in the series: backend architecture and services overview

Written by Yinzhou Wang