Finetuning Codestral-22B with QLoRA locally

7 min readJun 4, 2024

Part 1: Finetune and Evaluate code-generation LLMs

In this blog, we will learn how to fine-tune large language models with 4-bit quantization locally on multi-gpu instance (checkout this blog dedicated to do the same with Amazon Sagemaker). Today we are finetuning Codestral — checkout announcement from Mistral, to improve performance on code-generation task. We will be leveraging libraries like pytorch, huggingface, peft and bitsandbytes.

Breaking down the buzzwords

Codestral: a 22B parameters model with 32k context length. Its the new open-weight, code generation model from Mistral. Trained on 80+ programming languages. Can perform code generation tasks like fill-in-the-middle, write tests and code completion. Instruct version of this model supports tool-use. Their blog suggests Codestral outperforms all other models in RepoBench, a long-range eval for code generation. Its licensed under the new Mistral AI Non-Production License, which means that you can use it for research and testing purposes

Setting the Bar for Code Generation Performance. Source: Mistral Blog

Finetuning: Its a process where a pre-trained model is further trained on a custom dataset to adapt it for a domain or task and increase its performance. For instance, a supervised finetuned model with specialization in healthcare domain. Learn more here.
LoRA: short for Low-Rank Adaptation of large language models. This is a fine-tuning technique that introduces trainable rank decomposition matricesA and Bcalled low rank adapters, into each layer of transformer architecture while keeping the pre-trained weights W frozen, thereby reducing the total trainable parameters for downstream task. Instead of fine-tuning W, all updates happen in these low-rank matrices. Dive deep into practical aspects here. Read more in the original LoRA paper.

Quantization: Its a model compression technique (noisy and can lead to information loss) that converts the parameters of an LLM or weights from a high-precision data representation to a lower-precision data representation. Its essentially “rounding” from one data type to another. For ex: FP32(32-bit) to INT8(8-bit) or NF4(4-bit) representation.

LLM.int8(): memory efficient matrix multiplication computation steps. Source Credits

QLoRA: Quantized LoRA. Its a finetuning technique that uses 4-bit quantization to compress pretrained LLM’s weights. During finetuning, it backpropagates gradients through the frozen 4-bit quantized LLM into the Low-Rank Adapters.

Outline

Setup development environment
Create and prepare the dataset
Finetune LLM
Run inference

Note: This blog was created and validated on NVIDIA A10G with 4 GPUs each with 24Gb of memory. If you have access to more compute you can make changes to the configurations and optimize the GPU usage.

1. Setup development environment

Lets install all the required libraries. We will be using pytorch, Huggingface’s libraries: like transformers, accelerate, peft, bitsandbytes, trl (for supervised finetuning).

%pip install --quiet \
    "torch==2.3.0" \
    tensorboard

%pip install --upgrade --quiet \
    "transformers==4.41.2" \
    "accelerate==0.30.1" \
    "datasets==2.19.1" \
    "peft==0.11.1" \
    "bitsandbytes==0.43.1" \
    "trl==0.8.6" \
    "evaluate==0.4.2" \
    huggingface_hub huggingface

Next, login to HuggingFace inorder to access model and dataset. If you don’t have a HF account, you can create one here.

from huggingface_hub import login
 
login(
  token="", # add your HF token here
  add_to_git_credential=True
)

2. Create and prepare the dataset

Task at hand is to solve tough (2000+) competitive python coding problems from CodeForces. Problem statement can be reframed as a sequence-to-sequence translation task: given a problem description X in natural language, produce a corresponding solution Y in a programming language. [Source]

Dataset we have selected is deepmind/code_contests. Find further details in huggingface datasets page.

Sample data instance:

Broadly below are 4 steps to prepare dataset for finetuning:

Download dataset from Hub: The dataset is split ino train (13328), test (165), valid (117) samples. We will downsample the dataset to only 1% train split to complete training faster.

from datasets import load_dataset
from pprint import pprint

dataset = load_dataset("deepmind/code_contests", split="train[:1%]")
# dataset = load_dataset(dataset_id, split="test") # uncomment when want to perform eval inference
print(f"len(dataset): {len(dataset)}")
pprint(dataset.features)

features = [
  'name', 'description', 'public_tests', 'private_tests', 
  'generated_tests', 'source', 'difficulty', 
  'solutions', 'incorrect_solutions', 'cf_contest_id', 
  'cf_index', 'cf_points', 'cf_rating', 'cf_tags', 
  'is_description_trpytanslated', 'untranslated_description', 
  'time_limit', 'memory_limit_bytes', 'input_file', 'output_file'
]

Apply filter: select instances with 2000+ Codeforces rating and contains python language solutions.

def count_python_solutions(sample):
    df = pd.DataFrame(sample["solutions"])
    df_python = df[(df.language==3) | (df.language==1)]
    return df_python.shape[0]

# get instances with 2000+ rating and contains python lang solutions
dataset = dataset.filter(lambda sample: (sample["cf_rating"] >= 2000) & (count_python_solutions(sample) >= 1))
print(f"len(dataset): {len(dataset)}")

Augment dataset: for given dataset, lets say if every sample consists of {1 problem description, n solutions}, this function will convert it to n instances of {1 problem description, 1 solution}.

def augment_dataset(dataset):
    df = dataset.to_pandas()
    aug_rows = []
    for i, item in df.iterrows():
        for j, soln in enumerate(item["solutions"]["solution"]):
            language = item["solutions"]["language"][j]
            if (language==3 or language==1): # python3 or python2
                item_new = item.copy(deep=True)
                item_new["python_solution"] = soln
                item_new.drop('solutions', inplace=True)
                aug_rows.append(item_new)
    aug_df = pd.DataFrame(aug_rows)
    aug_ds = Dataset.from_pandas(aug_df)
    return aug_ds

# augment dataset: 1{1_problem + n_solutions} to n{1_problem + 1_solution}
dataset = augment_dataset(dataset) 
print(f"len(dataset): {len(dataset)}")

Apply instruct prompt template: SFTTrainer allows to pass dataset directly without any pre-processing. Available formats are coversational or instruction. We will go for instruction format

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

def format_dataset(sample):
  # check github code for details
  ...
  ...
  sample["prompt"] = prompt
  sample["completion"] = completion
  return sample

dataset = dataset.map(format_dataset, remove_columns=list(dataset.features), batched=False)

3. Finetune LLM

a. Initialize parameters: all parameters related to QLoRA, bitsandbytes, training.

b. Instantiate tokenizer and model: There are different variants of 4bit quantization such as NF4 (normalized float 4 (default)) or pure FP4 quantization. Other options include bnb_4bit_use_double_quant which uses a second quantization after the first one to save an additional 0.4 bits per parameter. While 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32 etc). The matrix multiplication and training will be faster if one uses a 16-bit compute dtype (default torch.float32). A rule of thumb is: use double quant if you have problems with memory, use NF4 for higher precision, and use a 16-bit dtype for faster finetuning. [Source]

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

# define tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

# define 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=load_in_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
    bnb_4bit_compute_dtype=getattr(torch, bnb_4bit_compute_dtype),
)

# define model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    use_cache=False if gradient_checkpointing else True,
    quantization_config=bnb_config,
    device_map="auto"
)
model.config.use_cache = False
model.config.pretraining_tp = 1 # num_of_gpus
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

c. Define LoraConfig: configuration where you define LoRA-specific parameters. Parameters:

r: the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.
lora_alpha: LoRA scaling factor.
lora_dropout: dropout probability of the LoRA layers
target_modules: The modules (for example, attention blocks) to apply the LoRA update matrices.
bias: Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.
task_type: used in the superclass PeftConfig for causal language modeling (TaskType.CAUSAL_LM)

lora_config = LoraConfig(
    r=r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=target_modules,
    bias="none",
    task_type=task_type,
)

d. Define Training Args, Collator, Trainer

from transformers import TrainingArguments
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

# set training arguments
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    max_steps=max_steps, # the total number of training steps to perform
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=gradient_checkpointing,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    weight_decay=weight_decay,    
    optim=optim,
    learning_rate=learning_rate,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    save_strategy=save_strategy,
    logging_steps=logging_steps,
    logging_strategy=logging_strategy,
    group_by_length=group_by_length,
)

# checkout for more info: Train on completions only https://huggingface.co/docs/trl/en/sft_trainer
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['prompt'])):
        text = f"{example['prompt'][i]}\n\n ### Answer: {example['completion'][i]}"
        output_texts.append(text)
    return output_texts

# initialize data collator
collator = DataCollatorForCompletionOnlyLM(
    response_template="### Answer:", 
    tokenizer=tokenizer
)

# initialize sft trainer
trainer = SFTTrainer(
    args=training_arguments,
    model=model,
    peft_config=lora_config,
    tokenizer=tokenizer,
    train_dataset=dataset,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    max_seq_length=max_seq_length,
    packing=packing
)

e. Train and save adapter weights

trainer.train()

# save int4 model
trainer.model.save_pretrained(output_dir, safe_serialization=False)

# clear memory
del model
del trainer
torch.cuda.empty_cache()

f. Merge adapter weights and base model

from peft import AutoPeftModelForCausalLM

# load PEFT model in fp16
model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
)
print(model)

# merge
merged_model = model.merge_and_unload()
print(merged_model)

# save merged model
merged_model.save_pretrained(save_model_dir, safe_serialization=True,  max_shard_size="2GB")

# save tokenizer for easy inference
tokenizer.save_pretrained(save_model_dir)

# clear memory
del model
del merged_model
del tokenizer

torch.cuda.empty_cache()

4. Run inference

# uncomment the test dataset and run all the cells within section Create and prepare dataset

import gc, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.cuda.empty_cache()
gc.collect()

model_local_path = "/home/ubuntu/sft_cache/model/"
print(f"model_local_path: {model_local_path}")

tokenizer = AutoTokenizer.from_pretrained(
    model_local_path, trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token

sft_model = AutoModelForCausalLM.from_pretrained(
    model_local_path,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

eval_sample = dataset[6]
eval_prompt, eval_completion = eval_sample["prompt"], eval_sample["completion"]

model_inputs = tokenizer([eval_prompt], return_tensors="pt").to("cuda")
sft_model.eval()
with torch.no_grad():
    generated_ids = sft_model.generate(
        **model_inputs, max_new_tokens=1000, do_sample=True
    )
    results = tokenizer.batch_decode(generated_ids)[0]
    print(results)

With this we have finetuned Codestral and ran inference!

Github Code

Further explore and spark your curiosity

Powerful idea
AlphaCodium: Paper, Github, Blog, Video
Langgraph Code Assistant Mistral: Youtube, Code
Langgraph for code generation: Blog
LLM.int8(): paper, blogpost, blogpost