Finetune and Deploy Codestral-22B with Amazon Sagemaker

Aastha Varma
7 min readJun 18, 2024

--

Part 2: Finetune and Evaluate code-generation LLMs

Photograph by Clane Gessel, National Geographic

In this blog, we will learn:

a) how to fine-tune open LLMs (Codestral, a 22B parameters model from Mistral) with FSDP and QLoRA using Amazon Sagemaker, and

b) how to deploy finetuned model with vLLM backend using LMI containers on Sagemaker endpoint.

Below is in continuation to my previous blog: Finetuning Codestral-22B with QloRA locally, but modified to run on Sagemaker.

View Code

FSDP Simplified

There are multiple approaches to train or finetune a large model at scale. Ex: Distributed data parallelism, Pipeline model parallelism, Mixture of Experts etc. FSDP is one of them, developed by Facebook AI Research (FAIR) Engineering (now Meta AI). Refer this blog to learn more!

Let’s understand what is FSDP and how it works!

FSDP is short for Fully Shared Data Parallel. It is a type of data-parallel training approach which shards parameters + optimizer state + gradient across data parallel workers (GPUs within DDP ranks) uniformly and optionally offload part of the training computation to the CPUs. With this, you can use bigger models, larger batch sizes and improve the training speed at scale. One thing to note is, although the parameters are sharded across DDP workers (or different GPUs), the computation (of activations) for each micro-batch of data is still local to each GPU worker.

With data parallelism approach, we require multiple copies of the model on each GPU — which is redundant, and model parallel training requires additional communication to move activations between workers (GPUs) — which is expensive. FSDP combines the best of both worlds: data parallelism and model parallelism. How? It improves memory efficiency by sharding model parameters, gradients, and optimizer states across GPUs, and improves computational efficiency by decomposing the communication and overlapping it with both the forward and backward passes.

Check out the image below for a quick review of PyTorch distributed operations: Scatter, Gather, Reduce, All-Reduce, Broadcast, All-Gather.

Source Credits

One way to view FSDP’s sharding is to decompose the DDP gradient all-reduce into reduce-scatter and all-gather. Specifically, during the backward pass, FSDP reduces and scatters gradients, ensuring that each rank possesses a shard of the gradients. Then it updates the corresponding shard of the parameters in the optimizer step. Finally, in the subsequent forward pass, it performs an all-gather operation to collect and combine the updated parameter shards.

Source Credits

In standard data parallel training methods, a copy of the model is present on each GPU and a sequence of forward and backward passes are evaluated on only a shard of the data. After these local computations, the parameters and optimizers for each local process are shared with the other GPUs in order to calculate the global weight update.

In FSDP, only a shard of the model is present on a GPU. Then, locally, all weights are gathered from the other GPUs — by means of an all-gather step — to calculate the forward pass. This gathering of weights is then performed again before the backward pass. After that backward pass, the local gradients are averaged and sharded across the GPUs by means of a reduce-scatter step, which allows each GPU to update its local weight shard.

The figure below illustrates standard DDP training (top) and FSDP training (bottom):

Source Credits

Outline

  1. Setup development environment
  2. Create and prepare the dataset
  3. Finetune model
  4. Deploy endpoint and run inference

Note: This blog was created and validated on NVIDIA A100 with 8 GPUs each with 40GB of memory. If you have access to different compute you can make changes to the configurations and optimize the GPU usage.

Lets get started !

1. Setup development environment

Let’s install all the required libraries. We will be using sagemaker, huggingface, datasets, plotly.

%pip install --upgrade --quiet boto3 sagemaker huggingface datasets plotly

Initialize variables

import json, boto3, sagemaker
dataset_id = 'deepmind/code_contests'
model_id = "mistral-community/Codestral-22B-v0.1"
base_job_name = "fsdp-codestral"

workspace_bucket_name = "research-agi"
s3_prefix = "mistral-community-codestral-22b-v0x1"
s3_train_dataset_path = f"s3://{workspace_bucket_name}/{s3_prefix}/train"
s3_test_dataset_path = f"s3://{workspace_bucket_name}/{s3_prefix}/test"
s3_save_model_dir = f"s3://{workspace_bucket_name}/{s3_prefix}/runs/"

role = sagemaker.get_execution_role()
session = sagemaker.session.Session(default_bucket=workspace_bucket_name)
region = session._region_name

2. Create and prepare dataset

Refer this blog, code for detail steps around data processing

from utils import data_utils
# load and save train dataset
train_dataset = data_utils.load_and_process(
dataset_id=dataset_id,
split="train[:60%]"
)
print(f"train_dataset: {train_dataset}")
train_dataset.save_to_disk(s3_train_dataset_path)
print(f"s3_train_dataset_path: {s3_train_dataset_path}")
# load and save test dataset
test_dataset = data_utils.load_and_process(
dataset_id=dataset_id,
split="test"
)
print(f"test_dataset: {test_dataset}")
test_dataset.save_to_disk(s3_test_dataset_path)
print(f"s3_test_dataset_path: {s3_test_dataset_path}")

3. Finetune model

a. Set Arguments. You can adjust the hyperparameters below and change it according to your needs for performance and optimization. I used common values. Check the exhaustive list of args for TrainingArguments here.

hyperparameters = {
### training related
"dataset_path": "/opt/ml/input/data",
"sm_save_model_dir": "/opt/ml/model",
"output_dir": "/tmp",
"logging_dir": "/tmp/logs",

"model_id": "mistral-community/Codestral-22B-v0.1",
"num_train_epochs": 1,
"max_steps": -1,
"per_device_train_batch_size": 1,
"per_device_eval_batch_size": 1,
"gradient_accumulation_steps": 1,
"gradient_checkpointing": True,
"gradient_checkpointing_kwargs": {
"use_reentrant": False,
},
"bf16": True,
"tf32": True,
"max_grad_norm": 0.3,
"weight_decay": 0.001,
"optim": "adamw_torch",
"learning_rate": 0.0002,
"warmup_ratio": 0.03,
"lr_scheduler_type": "constant",
"save_strategy": "no",
"logging_steps": 25,
"logging_strategy": "steps",
"group_by_length": True,
"max_seq_length": 4096,
"packing": False,
"finetune_with_sm": True,
"merge_weights_and_save": True,
"save_tokenizer": True,
"attn_implementation": "sdpa",

### qlora related
"lora_r": 64,
"lora_alpha": 16,
"lora_dropout": 0.1,
"task_type": "CAUSAL_LM",

### bitsandbytes related
"load_in_4bit": True,
"bnb_4bit_use_double_quant": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_storage": "bfloat16",
}

print('Hyperparameters: \n', json.dumps(hyperparameters, indent=2, default=str))

b. Define estimator.

Estimator is a high level interface for SageMaker Training. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. Learn about Estimators here

Simple training workflow in SageMaker. Source Credits
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
source_dir = "./scripts",
entry_point = "sft_fsdp_qlora.py",
base_job_name = base_job_name,
role = role,
sagemaker_session = session,
framework_version = "2.3.0",
py_version = "py311",
instance_count = 1,
instance_type = "ml.p4d.24xlarge", # gpus=8
volume_size = 300,
max_run = 1*24*60*60, # days * hours * minutes * seconds
hyperparameters = hyperparameters,
disable_profiler = True,
keep_alive_period_in_seconds = 1800,
debugger_hook_config = False,
distribution = {"torch_distributed": {"enabled": True}}, # enable torchrun
environment = {"HUGGINGFACE_HUB_CACHE": "/tmp/.cache"},
disable_output_compression = True,
output_path = s3_save_model_dir,
)

data = {
'train': s3_train_dataset_path,
'test' : s3_test_dataset_path,
}

print(f"training_image_uri: {estimator.training_image_uri()}")
print(f"data: {json.dumps(data, indent=2, default=str)}")

c. Begin training!

.fit() method will launch a training job that:

  • automatically spins up compute resources,
  • takes care of starting and managing all the required ec2 instances for us,
  • downloads the image container chose,
  • uploads the provided scripts (everything inside source_dir: like .py files, requirements.txt etc),
  • downloads the dependencies mentioned in requirements.txt if provided in source_dir,
  • downloads the data from s3 bucket (train and test data mentioned in data) into the container at /opt/ml/input/data,
  • executes the model training steps, and
  • shuts-down the resources automatically when the job is complete.
estimator.fit(data, wait=True)

With this finetuning is completed! Now, lets deploy our model to an endpoint 🚀

4. Deploy endpoint and run inference

We will use Sagemaker Python SDK and LMI containers for deploying the model into Sagemaker for a fully managed HTTPS endpoint in a single command.

LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution. LMI containers provide many features to maximize performance. To name a few: quantization (AWQ, GPTQ, SmoothQuant), token streaming, serving LoRA finetuned models. Learn more about LMI here and components of LMI here.

There are 2 ways to deploy model on Sagemaker. Depending on which configuration format you are using (serving.properties file or environment variables), the steps are slightly different. Find more details here.

We will need the following to deploy your model with LMI on SageMaker:

  • Model Artifacts (either HuggingFace Hub Model Id, or S3 URI pointing to model artifacts)
  • Instance Type
  • Container URI
  • Configuration File or Environment Variables

Below is the serving.properties file for vLLM inference backend to start with. Refer advanced vLLM configurations here for other options.

%%writefile djl_inference/serving.properties
engine=Python
option.model_id={{s3url}}
option.rolling_batch=vllm
option.dtype=bf16
option.tensor_parallel_degree=4
option.max_rolling_batch_size=1
option.model_loading_timeout=1800

Check github for the rest of code.

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

model.deploy(
initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout=1800,
# volume_size=300, # uncomment if using other than g5
endpoint_logging=True,
)

Run a prediction with inference configuration:

predictor = sagemaker.Predictor(
endpoint_name=endpoint_name,
sagemaker_session=session,
serializer=serializers.JSONSerializer(),
deserializer=deserializers.JSONDeserializer(),
)
prompt = "<write-your-prompt-here>"

res = predictor.predict(
{
"inputs": prompt,
"parameters": {
"max_new_tokens": 2048,
"do_sample":"true",
}
}
)
print(res["generated_text"])

With this we have finetuned Codestral and ran inference! 🤗

Dive deep and ignite your curiosity

  1. Rethinking PyTorch Fully Sharded Data Parallel (FSDP) from First Principles — link
  2. Pytorch FSDP — getting_started, advanced
  3. Pytorch Distributed Training — docs
  4. Training models on Amazon Sagemaker — docs, huggingface
  5. Deploy models to Amazon Sagemaker — docs, lmi, huggingface
  6. sagemaker-distributed-training-workshop
  7. llmops-workshop

--

--