Fine Tuning Mistral (or ANY LLM) using LoRA

6 min readJan 14, 2024

Edit (1/2/2024): Included the keyword ‘Mistral’ in the title for better SEO.

LLMs come in various shapes and architectures, with dozens of base models released in 2023. This trend is expected to continue in 2024, introducing new models with innovative features like sliding window attention, Mixture of Experts (MoE), etc. With these developments, it can be challenging to keep up with the updates, let alone learn how to use them. However, thanks to the brilliant minds in the open-source community, it usually takes only a few days to understand the strengths and weaknesses of new LLMs.

In this article, I will introduce a general template for fine-tuning any LLM hosted on Hugging Face and supported by the Transformers (PEFT) library. This article is aimed at those who want to tinker with technology and understand how it works. It is not about hardcore CUDA coding, nor is it about simply using UI and clicks for fine-tuning. Instead, this article delves into the basic concepts of LLM training without engaging in complex coding. The notebook containing the code is provided here.

Let’s start by importing the libraries and initializing the model and tokenizer. For this demonstration, we will use Mistral-7B-Instruct-v0.2, although you can use any LLM. We will load the model in 8-bit precision as it requires half the VRAM compared to 16-bit.

import torch
import transformers
from datasets import load_dataset, concatenate_datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftModel

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

Now, if you print the model, you can see its architecture.

We can see that it has many different layers. We are going to need the names later on. Then we will prepare the model for training and set the pad token ID. Note that if you use the EOS (End of Sentence) token as a PAD token, the model will not learn when to stop generating text (unless you change the data collator, which will be discussed more in the future). Hence, I use “!” as the PAD token (monke brain), since it is not commonly found in most datasets (at least the ones that I use) . You can use any other token too. It will not matter as the PAD token is removed from the loss calculation while training.

model = prepare_model_for_kbit_training(model)
tokenizer.pad_token = "!"

Then we declare the cutoff length (context length) and variables related to lora. These are some default values which I have found work pretty well with most datasets. However, the cutoff length is the number of tokens for each sample of the dataset that are used for training, so it changes with different datasets. 768 is a good number for most chat datasets, but you should make it longer if you plan to fine-tune on tasks which require longer data, for example, summarization of long articles, etc.

We also wrap these extra parameters around the model.

CUTOFF_LEN = 768
LORA_R = 8
LORA_ALPHA = 2 * LORA_R
LORA_DROPOUT = 0.1

config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj"
                    , "down_proj", "lm_head"], #these are the  names for the layers
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)

Now, if we print the model again, we will find something interesting.

We can see that for each of the target_modules which we pass to LoraConfig, there are two extra named modules, namely lora_A and lora_B. Now, if we check the Lora paper, we find that these A and B matrices are the low-rank update matrices, which are actually learned while fine-tuning, keeping all the other parameters frozen. This is the genius behind the idea (Thank you, Edward et al).

P.S., be sure not to include layernorm in the target_modules, as only MLP and Conv1D layers are supported.

Update Matrix as low rank multiplication of B and A

Figure shows q_proj being extended to base layer, lora_A and lora_B.

Then we will load the dataset. For this tutorial, I have found an interesting dataset about toxic chat (which can make the model reply in a toxic way). This dataset is hosted on Hugging Face, but you can easily use your own data from CSV or JSON files. The important point is to use a dataset that is compatible with the base model. Since we are using an instruction model here, we can use an instruction/chat dataset. If we were using a base model, we would have used a dataset for completion (such as code completion, novel completion, etc). Note that since Lora is not designed for alignment, you cannot effectively train/align a completion model to perform instruction/chat tasks just by training with Lora, even if you have a huge dataset.

dataset = load_dataset("lmsys/toxic-chat") #Interesting dataset with toxic chat.
print("dataset", dataset)
train_data = dataset["train"]

Now, if we check the dataset, we find that it looks like the following.

It has many fields, but we just want to use ‘user_input’ and ‘model_output’ to train our model. Hence, we will write our prompt accordingly. So, our prompt for training will depend on the fields we want to use and the official prompt which Mistral used to train their model, which can be found on the Hugging Face page of Mistral.

def generate_prompt(user_query):  #The prompt format is taken from the official Mistral huggingface page
  if user_query["model_output"] is not None and user_query["user_input"] is not None:
      p =  "<s> [INST]" + user_query["user_input"] + "[/INST]" +  user_query["model_output"] + "</s>"
      return p
  else:
    p = "<s> [INST]" + "Hello" + "[/INST]" +  "Hello" + "</s>"
    return p

Here, while running the prompt, I found out that some rows of the data were ‘None’, which caused errors while preparing the data. Hence, the if-condition checks for that. Then we will tokenize and prepare the data for being used in training (tokenization).

def tokenize(prompt):
    return tokenizer(
        prompt + tokenizer.eos_token,
        truncation=True,
        max_length=CUTOFF_LEN ,
        padding="max_length"
    )

Here we are using the CUTOFF_LEN parameter from before (=768)

Now, we will use these defined functions to generate the prompted and tokenized dataset.

train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x)), remove_columns=['conv_id', 'user_input', 'model_output', 'human_annotation', 'toxicity', 'jailbreaking', 'openai_moderation'])

Yess! We are almost done, we just need to pass the arguments to the Huggingface Trainer.

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_train_epochs=6,    # 3 or 6 is good
        learning_rate=1e-4,
        logging_steps=2,
        optim="adamw_torch",
        save_strategy="epoch",
        output_dir="mistral-lora-instruct-toxic"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False

Note that we are using the DataCollatorForLanguageModeling. However, recently with the SFT trainer, we could also use Seq2Seq or a custom Collator. I have not found any reliable paper stating that one is definitively better than the other. More on Collators and the SFT trainer another time! Finally, we start the training!

trainer.train()

Yes, the training starts and the loss decreases. I did not wait until the training was complete as it will take quite some time. Note that it took 23GB of VRAM to run this experiment. However, you can run it within 10GBs by keeping the context length small adn using the 8bit_adam optimizer.

Conclusion

We saw how to virtually train ANY LLM using the recipe above. Just observe the names of the layers and make necessary changes to the target_module and the generate_prompt() function according to your LLM and the dataset to be used. If you found this guide useful, feel free to follow my updates.

Thank you for reading!

Follow me on X(Twitter) or Github or Linkedin (or all ;D).

Fine Tuning Mistral (or ANY LLM) using LoRA

Thank you for reading!

Written by Prakhar Saxena