Mixtral finetuning as imagined by Dalle3

Finetuning Mixtral 7bx8

6 min readJan 3, 2024

This is a simple tutorial for fine-tuning Mixtral with one A100 (40GB) GPU. We will use qlora to fine-tune our model on this Shakespeare dataset from Hugging Face. This dataset converts modern English to Shakespearean English. To evaluate our model, we will use a set of 5 English sentences and ask our model to translate/convert them into Shakespearean English. Then, we will manually check the results for both the Original and the Fine-tuned model. So, let’s start with fine-tuning! Let’s begin by importing the required libraries and loading the model and tokenizer. Please install the libraries beforehand if you haven’t. The complete code is also available at github .

import torch
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftModel


tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1",
                                             load_in_4bit=True,
                                             torch_dtype=torch.float16,
                                             device_map="auto",
                                             )

Next we will prepare our model for training with lora in 4 bit (qlora).

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

tokenizer.pad_token = "!" #Not EOS, will explain another time.\

CUTOFF_LEN = 256  #Our dataset has shot text
LORA_R = 8
LORA_ALPHA = 2 * LORA_R
LORA_DROPOUT = 0.1

config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=[ "w1", "w2", "w3"],  #just targetting the MoE layers.
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Now, we will load our dataset and define functions for prompt template( as suggested by the official repository) and tokenizing the dataset. You can easily use csv or json files as dataset by simply changing one line:

dataset = load_dataset('csv', data_files='path_to_your_file.csv')

dataset = load_dataset("harpreetsahota/modern-to-shakesperean-translation") #Found a good small dataset for a quick test run! Thanks to the uploader!
print("dataset", dataset)
train_data = dataset["train"] # Not using evaluation data

def generate_prompt(user_query):
    sys_msg= "Translate the given text to Shakespearean style."
    p =  "<s> [INST]" + sys_msg +"\n"+ user_query["modern"] + "[/INST]" +  user_query["shakespearean"] + "</s>"
    return p 

def tokenize(prompt):
    return tokenizer(
        prompt + tokenizer.eos_token,
        truncation=True,
        max_length=CUTOFF_LEN ,
        padding="max_length"
    )

train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x)), remove_columns=["modern" , "shakespearean"])

Finally we will pass everything to the Hugging face trainer and start the training!

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_train_epochs=6,
        learning_rate=1e-4,
        logging_steps=2,
        optim="adamw_torch",
        save_strategy="epoch",
        output_dir="mixtral-moe-lora-instruct-shapeskeare"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False

trainer.train()

And like magic our models starts to train!!

The GPU usage is at around 31GB. So, I assume that you can also finetune it with a 32GB VRAM GPU, if someone succeds with that, let me know in the comments!

Next is evaluation.

First we prepare a list of 5 statements similar to the dataset. We employ GPT4 for this job. here is the lost of generated statements.

phrases = [
    "He dipped out on me after a couple of hangouts.",
    "I'm all about that artist, their vibe is straight fire!",
    "Linking up with the famjam this weekend.",
    "The expression 'Spill the beans' means to reveal secret information.",
    "Saying 'Hear it through the grapevine' indicates learning news through informal sources.",
]

Now we will input it into the original Mixtral, with the instruction to change them into Shakespearean English.

Here is the generated result.

 [INST]Translate the given text to Shakespearean style.
He dipped out on me after a couple of hangouts.[/INST] Mine acquaintance, so familiar in our meetings but a few,
Has forsook me, left me in the lurch, with nary a goodbye or adieu.
Our gatherings, once filled with mirth and good company,
He hath abandoned, with naught but silence and absence in their stead.
Alas, the cause I know not, nor can I fathom the reason,
But know this, dear friend, his departure hath left a bitter taste.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
I'm all about that artist, their vibe is straight fire![/INST] I do dote upon 
that artist, their spirit doth flame brightly!

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
Linking up with the famjam this weekend.[/INST] Prepare to make merry, for 
this weekend we shall meet 
With our kin, our blood, our family so sweet.
A reunion, a gathering, a joyous event
With laughter and cheer, our spirits shall vent.

In the company of our loved ones we'll partake
In food and drink, and for them our hearts will ache.
For though we may be separated by time and place
In the warmth of our kin, we shall find our embrace.

So let us don our finest attire and make our way
To the place of our gathering, without delay.
For this weekend we shall be reunited once more
And our hearts will sing and our souls will soar.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
The expression 'Spill the beans' means to reveal secret information.[/INST] In 
the tongue of the Bard, I shall render this phrase, 'Spill the beans',
As one who doth disclose hidden intelligence or deep-seated means.

A coinage of modernity it is not,
Yet in the court of mine own thoughts,
I have pondered this phrase with care,
To fashion its essence in antiquated air.

Wherefore, let us employ the term,
'Disclose the secret germ',
Or, if thou doth prefer a culinary theme,
'Reveal the hidden yeast in the loaf of a dream'.

Forsooth, the essence remains unchanged,
To spill the beans is to break the seal,
The chain that doth bind secret thoughts,
In this, I entreat thee, thou mayst feel,
A power to unclasp and set free,
The truth that hath been hidden in decree.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
Saying 'Hear it through the grapevine' indicates learning news through informal
 sources.[/INST] To declare, " 'Tis heard through the vine's tendrils," is to 
speak of learning news from sources unofficial and informal.

Not bad! The original model is quite powerful and understands the task, however its responses are quite long.

Next is fine tuned model-

  [INST]Translate the given text to Shakespearean style.
He dipped out on me after a couple of hangouts.[/INST]With much delight, we oft did meet,
Two souls engaged in friendly treat,
But sudden, like the wind, he vanished,
Our camaraderie, by him, abandoned.

A few brief gatherings, was the sum
Of time we shared, before he fled,
Like fleeting shadow on the wall,
He slipped away, and naught was said.

His absence, like a chilling breeze,
Leaves me alone, as though on a lea,
Thus ends the tale, of one who'd leave,
Me standing, with naught but memories, in sleeves.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
I'm all about that artist, their vibe is straight fire![/INST] I do dote 
upon that artist, their manner is as a blazing flame!

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
Linking up with the famjam this weekend.[/INST]Joining in union with the 
kin-kinship this day of the weekend.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
The expression 'Spill the beans' means to reveal secret information.[/INST] In
the tongue of the Bard, I shall render this phrase:
'To disclose the hidden matter', or 'To let flow the beans' doth signify the 
unmasking of secret knowledge.

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 [INST]Translate the given text to Shakespearean style.
Saying 'Hear it through the grapevine' indicates learning news through informal
sources.[/INST] To utter 'Hear it through the grapevine' is akin to saying that
thou hast learned news from sources unofficial.

Our fine-tuned model generates shorter texts which are more in line with the dataset, showing that our fine-tuning worked! This approach works amazingly for more complicated tasks which the base model is not capable. I used fine-tuning to train a chat model in the Japanese language. The fine-tuned model with a Japanese dataset resulted in a much more fluent Japanese ability. (I cannot share the model here as I did it for a client, but you get the idea 😉).

Conclusion

Today we saw how to tackle the MoE for finetuning.

Thank you for reading!

If you found this guide useful follow me on X(Twitter) or Github or Linkedin (or all ;D).

Finetuning Mixtral 7bx8

Written by Prakhar Saxena