Finetuning Mixtral-8x7B-Instruct-v0.1 on 1xA100 (80GB)

Oyetunji Abioye
3 min readJan 18, 2024

--

Image by author using DALL·E. All images by the author unless otherwise specified.

Introduction

The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. This is a specific architecture meaning the model consists of multiple sub-models called “experts” which Mixtral-8x7B has 8, each of these experts are specialized to optimize decisions for different tasks, it is also called sparse because not all experts are used to make a decision, the model uses a router to selectively activate a small subset of experts for each input, making it efficient to scale. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks tested. I am going to show you how to finetune the Mixtral-8x7B-Instruct-v0.1 model using QLoRA on an A100 (80GB) GPU. The dataset used 2nji/littlehermes is a subset of teknium/openhermes which is composed of 242,000 entries of primarily GPT-4 generated data. This Article code can be found in this notebook .

%pip install -q -U bitsandbytes transformers peft accelerate datasets scipy

First we install our dependencies above, Then we import our dependencies and download the Mixtral-8x7B-Instruct-v0.1 model and tokenizer below.

import torch
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftModel



model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
new_model = "2nji/mistral8x7B"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name,
load_in_4bit=True,
torch_dtype=torch.float16,
device_map="auto")
tokenizer.pad_token = "!"

We will use QLoRA for finetuning, QLoRA works by adjusting a pre-trained language model that has been simplified (quantized to 4-bit) and set in a fixed state (frozen). It does this by sending feedback signals (gradients) back through the model to tweak special small modules (Low Rank Adapters or LoRA) added into the model.”, In our Experiment the LoRA attention dimension (the rank of the update matrices) is set to 8, whilst the LoRA alpha (the alpha parameter for LoRA scaling) is set to 16. The LoRA dropout (the dropout probability for LoRA layers) is set to 0.1. The fine-tuning process uses the AdamW optimizer with a learning rate of 1e−4 and runs through 6 epochs

LORA_R = 8
LORA_ALPHA = 2 * LORA_R
LORA_DROPOUT = 0.1

config = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
target_modules=[ "w1", "w2", "w3"], # Only Training the "expert" layers
lora_dropout=LORA_DROPOUT,
bias="none",
task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

def print_trainable_parameters(m):
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in m.parameters())
print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params / all_params}")

print_trainable_parameters(model)

train_data = load_dataset("2nji/littlehermes", split="train")
print("Dataset", train_data)

def generate_prompt(user_query, sep="\n\n### "): #The prompt format is taken from the official Mixtral huggingface page
sys_msg= "Take a look at the following instructions and try to follow them."
p = "<s> [INST]" + sys_msg +"\n"+ user_query["instruction"] + "[/INST]" + user_query["output"] + "</s>"
return p

max_len = 1024

def tokenize(prompt):
return tokenizer(
prompt + tokenizer.eos_token,
truncation=True,
max_length=max_len,
padding="max_length"
)

train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x)), remove_columns=["instruction" , "output"])

trainer = Trainer(
model=model,
train_dataset=train_data,
args=TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=6,
learning_rate=1e-4,
logging_steps=2,
optim="adamw_torch",
save_strategy="epoch"
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False

# Train model
trainer.train()
# Save trained model
trainer.model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

# Push them to the HF Hub
trainer.model.push_to_hub(new_model, use_temp_dir=False, token="")
tokenizer.push_to_hub(new_model, use_temp_dir=False, token="")

Finally we save our finetuned model and push it to our HuggingFace Repository. You can run inference using the new model with code below.

# Format prompt
message = [
"Write a Python function that calculates the factorial of a given number using recursion.",
"What is the probability of rolling a 4 on a single standard six-sided dice?",
]
tokenizer = AutoTokenizer.from_pretrained(new_model)
prompt = tokenizer(message, return_tensors="pt", padding=True)
# Generate output
output = trainer.model.generate(
input_ids=prompt.input_ids,
attention_mask=prompt.attention_mask,
max_length=128,
do_sample=True,
top_p=0.95,
top_k=60,
num_return_sequences=1,
)
# Print output
print(tokenizer.batch_decode(output, skip_special_tokens=True))

Stay Curious!

--

--