Finetune and Deploy Codestral-22B with Amazon Sagemaker
Part 2: Finetune and Evaluate code-generation LLMs
In this blog, we will learn:
a) how to fine-tune open LLMs (Codestral, a 22B parameters model from Mistral) with FSDP
and QLoRA
using Amazon Sagemaker
, and
b) how to deploy finetuned model with vLLM
backend using LMI containers
on Sagemaker endpoint.
Below is in continuation to my previous blog: Finetuning Codestral-22B with QloRA locally, but modified to run on Sagemaker.
FSDP Simplified
There are multiple approaches to train or finetune a large model at scale. Ex: Distributed data parallelism, Pipeline model parallelism, Mixture of Experts etc. FSDP is one of them, developed by Facebook AI Research (FAIR) Engineering (now Meta AI). Refer this blog to learn more!
Let’s understand what is FSDP and how it works!
FSDP is short for Fully Shared Data Parallel. It is a type of data-parallel training approach which shards parameters + optimizer state + gradient across data parallel workers (GPUs within DDP ranks) uniformly and optionally offload part of the training computation to the CPUs. With this, you can use bigger models, larger batch sizes and improve the training speed at scale. One thing to note is, although the parameters are sharded across DDP workers (or different GPUs), the computation (of activations) for each micro-batch of data is still local to each GPU worker.
With data parallelism approach, we require multiple copies of the model on each GPU — which is redundant, and model parallel training requires additional communication to move activations between workers (GPUs) — which is expensive. FSDP combines the best of both worlds: data parallelism and model parallelism. How? It improves memory efficiency by sharding model parameters, gradients, and optimizer states across GPUs, and improves computational efficiency by decomposing the communication and overlapping it with both the forward and backward passes.
Check out the image below for a quick review of PyTorch distributed operations: Scatter
, Gather
, Reduce
, All-Reduce
, Broadcast
, All-Gather
.
One way to view FSDP’s sharding is to decompose the DDP gradient all-reduce
into reduce-scatter
and all-gather
. Specifically, during the backward pass, FSDP reduces
and scatters
gradients, ensuring that each rank possesses a shard of the gradients. Then it updates the corresponding shard of the parameters in the optimizer step. Finally, in the subsequent forward pass, it performs an all-gather
operation to collect and combine the updated parameter shards.
In standard data parallel training methods, a copy of the model is present on each GPU and a sequence of forward and backward passes are evaluated on only a shard of the data. After these local computations, the parameters and optimizers for each local process are shared with the other GPUs in order to calculate the global weight update.
In FSDP, only a shard of the model is present on a GPU. Then, locally, all weights are gathered from the other GPUs — by means of an all-gather step — to calculate the forward pass. This gathering of weights is then performed again before the backward pass. After that backward pass, the local gradients are averaged and sharded across the GPUs by means of a reduce-scatter step, which allows each GPU to update its local weight shard.
The figure below illustrates standard DDP training
(top) and FSDP training
(bottom):
Outline
- Setup development environment
- Create and prepare the dataset
- Finetune model
- Deploy endpoint and run inference
Note: This blog was created and validated on NVIDIA A100 with 8 GPUs each with 40GB of memory. If you have access to different compute you can make changes to the configurations and optimize the GPU usage.
Lets get started !
1. Setup development environment
Let’s install all the required libraries. We will be using sagemaker
, huggingface
, datasets
, plotly
.
%pip install --upgrade --quiet boto3 sagemaker huggingface datasets plotly
Initialize variables
import json, boto3, sagemaker
dataset_id = 'deepmind/code_contests'
model_id = "mistral-community/Codestral-22B-v0.1"
base_job_name = "fsdp-codestral"
workspace_bucket_name = "research-agi"
s3_prefix = "mistral-community-codestral-22b-v0x1"
s3_train_dataset_path = f"s3://{workspace_bucket_name}/{s3_prefix}/train"
s3_test_dataset_path = f"s3://{workspace_bucket_name}/{s3_prefix}/test"
s3_save_model_dir = f"s3://{workspace_bucket_name}/{s3_prefix}/runs/"
role = sagemaker.get_execution_role()
session = sagemaker.session.Session(default_bucket=workspace_bucket_name)
region = session._region_name
2. Create and prepare dataset
Refer this blog, code for detail steps around data processing
from utils import data_utils
# load and save train dataset
train_dataset = data_utils.load_and_process(
dataset_id=dataset_id,
split="train[:60%]"
)
print(f"train_dataset: {train_dataset}")
train_dataset.save_to_disk(s3_train_dataset_path)
print(f"s3_train_dataset_path: {s3_train_dataset_path}")
# load and save test dataset
test_dataset = data_utils.load_and_process(
dataset_id=dataset_id,
split="test"
)
print(f"test_dataset: {test_dataset}")
test_dataset.save_to_disk(s3_test_dataset_path)
print(f"s3_test_dataset_path: {s3_test_dataset_path}")
3. Finetune model
a. Set Arguments. You can adjust the hyperparameters
below and change it according to your needs for performance and optimization. I used common values. Check the exhaustive list of args for TrainingArguments here.
hyperparameters = {
### training related
"dataset_path": "/opt/ml/input/data",
"sm_save_model_dir": "/opt/ml/model",
"output_dir": "/tmp",
"logging_dir": "/tmp/logs",
"model_id": "mistral-community/Codestral-22B-v0.1",
"num_train_epochs": 1,
"max_steps": -1,
"per_device_train_batch_size": 1,
"per_device_eval_batch_size": 1,
"gradient_accumulation_steps": 1,
"gradient_checkpointing": True,
"gradient_checkpointing_kwargs": {
"use_reentrant": False,
},
"bf16": True,
"tf32": True,
"max_grad_norm": 0.3,
"weight_decay": 0.001,
"optim": "adamw_torch",
"learning_rate": 0.0002,
"warmup_ratio": 0.03,
"lr_scheduler_type": "constant",
"save_strategy": "no",
"logging_steps": 25,
"logging_strategy": "steps",
"group_by_length": True,
"max_seq_length": 4096,
"packing": False,
"finetune_with_sm": True,
"merge_weights_and_save": True,
"save_tokenizer": True,
"attn_implementation": "sdpa",
### qlora related
"lora_r": 64,
"lora_alpha": 16,
"lora_dropout": 0.1,
"task_type": "CAUSAL_LM",
### bitsandbytes related
"load_in_4bit": True,
"bnb_4bit_use_double_quant": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_storage": "bfloat16",
}
print('Hyperparameters: \n', json.dumps(hyperparameters, indent=2, default=str))
b. Define estimator.
Estimator
is a high level interface for SageMaker Training. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. Learn about Estimators here
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
source_dir = "./scripts",
entry_point = "sft_fsdp_qlora.py",
base_job_name = base_job_name,
role = role,
sagemaker_session = session,
framework_version = "2.3.0",
py_version = "py311",
instance_count = 1,
instance_type = "ml.p4d.24xlarge", # gpus=8
volume_size = 300,
max_run = 1*24*60*60, # days * hours * minutes * seconds
hyperparameters = hyperparameters,
disable_profiler = True,
keep_alive_period_in_seconds = 1800,
debugger_hook_config = False,
distribution = {"torch_distributed": {"enabled": True}}, # enable torchrun
environment = {"HUGGINGFACE_HUB_CACHE": "/tmp/.cache"},
disable_output_compression = True,
output_path = s3_save_model_dir,
)
data = {
'train': s3_train_dataset_path,
'test' : s3_test_dataset_path,
}
print(f"training_image_uri: {estimator.training_image_uri()}")
print(f"data: {json.dumps(data, indent=2, default=str)}")
c. Begin training!
.fit()
method will launch a training job that:
- automatically spins up compute resources,
- takes care of starting and managing all the required ec2 instances for us,
- downloads the image container chose,
- uploads the provided scripts (everything inside
source_dir
: like .py files, requirements.txt etc), - downloads the dependencies mentioned in
requirements.txt
if provided insource_dir
, - downloads the data from s3 bucket (train and test data mentioned in
data
) into the container at/opt/ml/input/data
, - executes the model training steps, and
- shuts-down the resources automatically when the job is complete.
estimator.fit(data, wait=True)
With this finetuning is completed! Now, lets deploy our model to an endpoint 🚀
4. Deploy endpoint and run inference
We will use Sagemaker Python SDK
and LMI containers
for deploying the model into Sagemaker for a fully managed HTTPS endpoint in a single command.
LMI containers
are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM
, TensorRT-LLM
, Transformers NeuronX
to deploy LLMs on AWS SageMaker Endpoints
. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution. LMI containers provide many features to maximize performance. To name a few: quantization (AWQ
, GPTQ
, SmoothQuant
), token streaming, serving LoRA finetuned models. Learn more about LMI here and components of LMI here.
There are 2 ways to deploy model on Sagemaker. Depending on which configuration format you are using (serving.properties
file or environment variables), the steps are slightly different. Find more details here.
We will need the following to deploy your model with LMI on SageMaker:
- Model Artifacts (either
HuggingFace Hub Model Id
, orS3 URI pointing to model artifacts
) - Instance Type
- Container URI
- Configuration File or Environment Variables
Below is the serving.properties file for vLLM
inference backend to start with. Refer advanced vLLM
configurations here for other options.
%%writefile djl_inference/serving.properties
engine=Python
option.model_id={{s3url}}
option.rolling_batch=vllm
option.dtype=bf16
option.tensor_parallel_degree=4
option.max_rolling_batch_size=1
option.model_loading_timeout=1800
Check github for the rest of code.
model = Model(image_uri=image_uri, model_data=code_artifact, role=role)
model.deploy(
initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout=1800,
# volume_size=300, # uncomment if using other than g5
endpoint_logging=True,
)
Run a prediction with inference configuration:
predictor = sagemaker.Predictor(
endpoint_name=endpoint_name,
sagemaker_session=session,
serializer=serializers.JSONSerializer(),
deserializer=deserializers.JSONDeserializer(),
)
prompt = "<write-your-prompt-here>"
res = predictor.predict(
{
"inputs": prompt,
"parameters": {
"max_new_tokens": 2048,
"do_sample":"true",
}
}
)
print(res["generated_text"])
With this we have finetuned Codestral and ran inference! 🤗
Dive deep and ignite your curiosity
- Rethinking PyTorch Fully Sharded Data Parallel (FSDP) from First Principles — link
- Pytorch FSDP — getting_started, advanced
- Pytorch Distributed Training — docs
- Training models on Amazon Sagemaker — docs, huggingface
- Deploy models to Amazon Sagemaker — docs, lmi, huggingface
- sagemaker-distributed-training-workshop
- llmops-workshop