Mistral Mastery: Fine-Tuning & Fast Inference Guide

8 min readOct 29, 2023

I call this “Mistral Pixlipse”, generated in my potato laptop.

Mistral-7b-Inst is a game-changer LLM developed by Mistral AI which outperforms many popular LLMs. It’s released under Apache 2.0 license, which makes it suitable to use in a commercial setting.

In this article, we will explore how to finetune Mistral-7b-Inst on a custom dataset. We will focus on efficient use of computation resources and fast inference to make it production-ready.

This article has two major segments:

Fine-Tuning Mistral-7B-Inst with QLoRA
Superfast Inference with vLLM

Fine-Tuning Mistral-7B-Inst with QLoRA

Dataset Preparation

I always had this question how many samples do you need in your custom dataset for finetuning?

Mostly it depends on the model, task, data quality, etc, but roughly 5k is a good number.

Prompt selection is really an underrated task in fine-tuning. Basically, you do prompt engineering and select the prompt that aligns the most with your dataset outputs.

# Assuming TEMPLATE1, TEMPLATE2, TEMPLATE3, test_input and test_output are defined elsewhere

test_prompts = [
    TEMPLATE1.format(test_input),
    TEMPLATE2.format(test_input),
    TEMPLATE3.format(test_input),
]

# Load your baseline model and tokenizer
base_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
baseline = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# Loop through each prompt to generate output and compare with the expected output
for prompt in test_prompts:

    encoded_prompt = tokenizer(prompt, return_tensors="pt").to("cuda")
    generated_ids = baseline.generate(
        **encoded_prompt,
        max_new_tokens=300, # set accordingly to your test_output
        do_sample=False
    )

    decoded_output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

    # Output results for comparison
    print(f"Generated Output: {decoded_output}\n")
    print(f"Expected Output: {test_output}\n")
    print("-" * 75)

In baseline.generate() make sure to set do_sample = False, it makes generated output consistent, making prompt comparison easier.

Since we have finalized our prompt template, now we will combine our dataset with our prompt template.

# Basic prompt structure for down streaming task
def add_prompt(i, df):
    context = df.iloc[i]['context']    
    output = df.iloc[i]['output']
    prompt = f"""
    Context:
    {context}
    
    [INST]Add instruction for your task here[/INST]
    Output: {output} </s>"""    
    return prompt

Important tip:

By default, the Mistral tokenizer only adds <s> (BOS token) to the prompt but not </s> (EOS token), make sure to add it at the end of your prompt.

Here is an example that demonstrates how to convert your raw dataset into the HuggingFace dataset format.

import pandas as pd
from datasets import Dataset, DatasetDict

# Load custom dataset from an excel file
raw_dataset = pd.read_excel('./dataset/training_dataset.xlsx')

# Selecting only the necessary features
raw_dataset = raw_dataset[['context', 'output']]

# Shuffle the dataset
seed = 123
raw_dataset = raw_dataset.sample(frac=1, random_state=seed).reset_index(drop=True)

# Define sample size for training data
sample_size = int(len(raw_dataset) * 0.99)

# Splitting the dataset into training and test data
train_data = raw_dataset.iloc[:sample_size].copy()
test_data = raw_dataset.iloc[sample_size:].copy()

# Add prompt template to the dataset
train_prompts = [add_prompt(i, train_data) for i in range(len(train_data))]
test_prompts = [add_prompt(i, test_data) for i in range(len(test_data))]

# Convert these lists of prompts into Datasets
train_dataset = Dataset.from_dict({"text": train_prompts})
test_dataset = Dataset.from_dict({"text": test_prompts})

# Combine these Datasets into a DatasetDict
dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

After all the above steps, your dataset structure should be like below.

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 4011
    })
    test: Dataset({
        features: ['text'],
        num_rows: 41
    })
})

Now your dataset is ready for finetuning.

Setting Up Tokenizer

Loading the tokenizer is pretty straightforward.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1")

tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

In the above code snippet, you might have noticed tokenizer.pad_token and tokenizer.padding_side .

For the padding side, using the default tokenizer setting was giving me weird fine-tuning results., i.e., tokenizer.padding_side = "left" .

I have seen some notebooks setting up the pad_token like: tokenizer.pad_token = tokenizer.eos_token

But I don’t recommend it, because eos_token (i.e., </s> for Mistral tokenizer) signals the LLM to stop generating tokens.

If you set the eos_token as a padding token, the tokenizer will set the eos_token attention mask as “0”. The model will tend to ignore the eos_token and might over-generate tokens, which is not ideal for a down-streaming task.

A more suitable option will be unk_token , because of its rarity, it will barely degrade the model’s performance even if we set its attention mask to “0” (i.e., set it as a padding token)

Loading the Model in 4bit quantization

from transformers import (
    AutoModelForCausalLM,
    BitsAndBytesConfig
)

compute_dtype = getattr(torch, "float16")
    use_4bit = True
   
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

device_map = "auto"
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config,  # loading model in 4-bit
    device_map=device_map, # to use max gpu resources if exist
)

#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id

Loading 4bit saves memory but increases training time. Generally speaking, saving memory takes more precedence than increased training time.

One thing to notice is that Mistral-7B-Inst doesn’t have a padding token, that’s why by default model.config.pad_token_id is set to None.

We need to set it to the tokenizer’s pad_token_id, i.e., <UNK>.

LoRA config

peft_config = LoraConfig(
        lora_alpha=64,
        lora_dropout=0.1,
        r=32,
        bias="none",
        task_type="CAUSAL_LM", 
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
            "lm_head",
        ],
    )

r and lora_aplha are the most important parameters in LoRA configuration.

r is the rank of the LoRA matrices:

A higher r-value means more trainable parameters, allowing for more expressivity. But, on the negative side, there is a compute tradeoff, and may also lead to overfitting.
A lower r-value means less trainable parameters, it can reduce overfitting at the cost of expressiveness.

lora_aplha is a scaling factor for LoRA weights:

Higher alpha will put more emphasis on LoRA weights.
Lower alpha will put reduced emphasis on LoRA weights, hence model will be more dependent on its original weights.

Important tips:

Golden rule: lora_aplha = 2*r, i.e., if r=128 and lora_aplha should be 256
Both r and lora_aplha should be in 2**x value, a good range for selection will be [8, 16, 32, 64, 128, 256, 512]
If your fine-tuning data is very different from the pre-training data of your model, I recommend selecting r and lora_aplha from the higher values from the above range and vice versa.

Training Arguments

from transformers import TrainingArguments

run_name = "enter run name"
training_arguments = TrainingArguments(
    output_dir="./models/"+run_name,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4, # Increase, if still giving OOM error
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    save_steps=500,
    logging_steps=200,
    learning_rate=3e-4,
    fp16=True, # Enable fp16, bf16 only if your gfx card supports it
    gradient_checkpointing=True,
    evaluation_strategy="steps", 
    max_grad_norm=0.3,
    num_train_epochs=1.0,
    weight_decay=0.001,
    warmup_steps=50,
    lr_scheduler_type="linear",
    run_name=run_name,
    report_to='wandb',
)

The above setting is recommended if you have a low VRAM GPU.

If you have enough VRAM here are some recommendations:

Enable per_device_eval_batch_size
Increase training and evaluation batch size.
Use AdamW (full precision) optimizer.
Disable gradient_accumulation_steps and gradient_checkpointing.
Set fp16 and bf16 to False.

SFT Trainer

import os, wandb
from trl import SFTTrainer

os.environ["WANDB_API_KEY"] = "Enter your key"

run = wandb.init(project="Mistral-Inst-7b", name= run_name)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset, # remove you have low VRAM and getting OOM errors
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096, # depends on your dataset
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

trainer.train()

This is where all the above steps come together.

Here you pass your customized model, tokenizer, dataset, peft_config, and training arguments.

Now you are all set to train.

LoRA Merge

At the time of writing this article, the vLLM library doesn’t have LoRA support. So the only way to load your finetuned model is to merge the LoRA adapter with the base model.

from peft import LoraConfig, PeftModel
from peft import AutoPeftModelForCausalLM

#Load the base model with default precision
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
adapter = "Enter the path of your LoRA adapter" 
model = AutoModelForCausalLM.from_pretrained(model_name)

#Load and activate the adapter on top of the base model
model = PeftModel.from_pretrained(model, adapter)

#Merge the adapter with the base model
model = model.merge_and_unload()

#Save the merged model in a directory in the safetensors format
model_dir = "./models/merged_model/"
model.save_pretrained(model_dir, safe_serialization=True)

#Save the custom tokenizer in the same directory
tokenizer.save_pretrained(model_dir)

Also, save the tokenizer that you have configured above in the same directory. So, that you can load the model and tokenizer in one go.

Superfast inference with vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Efficient management of attention key and value memory with PagedAttention.

Note: I had some installation issues with this library, for which I had to update my Linux, Nvidia drivers, and CUDA toolkit to the latest version.

Load model

from vllm import LLM, SamplingParams

llm = LLM(model="./models/merged_model/")

Use llm.get_tokenizer() to check whether your tokenizer is loaded with the correct configuration or not.

Setting sampling parameters

sampling_params = SamplingParams(max_tokens=4096, # set it same as max_seq_length in SFT Trainer
                  temperature=0.1,
                  skip_special_tokens=True)

In SamplingParams set max_tokens same as max_seq_length in SFT Trainer and set temperature to 0.1.

This will help to generate output closely similar to your QLoRA finetuned model with minimal degradation.

Generate!

input_data = [input1, input2, input3, ...] #example
prompts = []

TEMPLATE = """
    Context:
    {}
    
    [INST]Instruction for your task[/INST]
    Output: """ # The prompt is same as training one, just without output part

def add_prompt(sample):
    prompt = TEMPLATE.format(sample)
    return prompt

for sample in input_data:
    text = add_prompt(sample)
    prompts.append(text)

outputs = llm.generate(prompts, sampling_params) # Batch inference

for output in outputs:
    generated_text = output.outputs[0].text
    print(f"Generated text: {generated_text}")

Now I will show you the outputs of a fine-tuned version of Mistral-7B-Inst for the task of text summarization. Let’s compare its inference speed difference with and without using vLLM:

Inference speed without using vLLM

Summary: 
Protective case for electronic devices like mobile phones that has a dispensing unit for cosmetics. The case has a movable container inside that dispenses product when pushed. The container is mounted on a support structure that moves with the case. The container has a push button that presses on the outlet duct when pushed. This allows the container to dispense product without needing to press the button directly.</s>
-----------------------------------------------------------------------

execution time:  6.119
new tokens:  87
input tokens:  1066
generation rate: 14.216 tokens/seconds

Inference speed using vLLM

Summary:
A protective case for an electronic device like a phone that has a built-in dispensing unit for a product like a cosmetic. The case has a shell that surrounds the device and a movable container inside the shell that holds the product. The container is attached to a support structure that moves with the shell. When the container is pushed, it moves with the support structure and presses a button to dispense the product.</s>
--------------------------------------------------------------------------------------------------

execution time:  2.417
new tokens:  90
input tokens:  1066
generation rate: 37.223 tokens/seconds

Cool right! The inference time has been reduced by half using vLLM. But batch inference is where this library really shines.

Execution time for batch size 10:  6.770 seconds
Avg execution time for one sample:  0.677 seconds
Avg Generation rate: 149.904 tokens/seconds

Finally, your model is fine-tuned, fast, and ready to serve in production.

Scope for further improvement

These are the features I wish vLLM had for more efficient serving:

LoRA support: Merging LoRA with the base model every time for a single downstream is very memory-expensive.
Multi GPU inference: They have this feature but it's broken in v0.2.0, I hope they fix it soon.
LLM(model_name, tensor_parallel_size= max_gpus)
BitsAndBytes quantization support: To load models in 4-bit and 8-bit.

Check out their development roadmap.