MLX: Building & Fine tuning LLM Model on Apple m3 using custom dataset

deepak kumar
9 min readApr 11, 2024

--

MLX

I recently got apple macbook pro m3 64gb lapotp. Till now i have did fine-tuning of LLM using Peft, bits n bytes , sfftrainer backed by Nvidia graphic card.

The bitsandbytes library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantisation functions.

But Apple macbook used MPS as backend gpu device, which means i was not able to use this libarary for quantization functions. Yes without quantization also i can use PEFT and do Supervised fine tunning, but it will be more resource consuming and this will be Lora process not Qlora.

So MLX library is built by apple similar to tensorflow, pytorch, to do gpu backed tasks.

Apple is still building MLX libarary so i faced few issues while working with this library.

  1. lack of proper documentation and clear guidance on end to end process.
  2. Majority example is shown as terminal command, but in actual work we may need to use inside py file.
  3. Still this library supports few models.
  4. Other issue with its github repo is it has redundant code in multiple folders.

Installation : https://pypi.org/project/mlx-lm/

Github repo : https://github.com/ml-explore/mlx-examples/tree/main

Before we jump to fine-tunning, lets first try to understand its github repo, which itself is bit confusing.

Mlx github main page

if we look here we will see two folder relevant to LLM this is
1. llms
2. lora

naturally i also focussed on lora folder, as here they have mentioned quick example to fine-tune llm, but to my surprise this is just demo folder not the one which is actively updated, hence lack detail description and support.
Update : now they have mentioned in this page about this.

Lets dig into main page which is being using by above PIP Mlx
1. the path is llms->mlx_lm. Yes this is the main page and in this down they have mentioned about Lora documentation that is nothing but LORA.md file. https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md

This is the treasure file which has in detail method to finetune LLM.

Now lets start step by step:
1. Dataset creation
2. Model load and quantization
3. Model data setup, defining lora parameters and training
4. Loading trained model and validation result.
5. Comparison of LLM vs finetuned LLM result based on bert score and Rouge score.

Dataset creation

MLX library provide three format to create dataset and to be saved in three file that is train.jsonl , test.jsonl and valid.jsonl.

Formats :

There are three formats :
1. chat
2. completions
3. text

My finding i found completion and chat are same, meaning, if we made our dataset in completion format, in backend code it is being converted to chat format. The below code deals with when our dataset is in completion format . github path : llms/mlx_lm/tuner/datasets.py

From my personal experience use chat format only because completion is same as chat and text one is not good, because they are not applying any chat template of base model to it and it didnot yield good result for me.

Lets now create dataset for our fine tuning.
Dataset used : https://huggingface.co/datasets/neil-code/dialogsum-test?row=0
Here we used
1. input prompt : dialogue
2 . output : dialogue and topic

Lets dig into python code:

I downloaded all csv file

import pandas as pd
import json
import pandas as pd
data = pd.read_csv("train.csv")
data.head(4)
dataset output
dialogue = list(data['dialogue'])
summary = list(data['summary'])
topic = list(data['topic'])

def data_create(topic:str,question:str,answer:str):
chat = {"messages": [
{"role": "user", "content": f"Instruct: Summarize the following conversation and and also give topic to it\n.{question}"},
{"role": "assistant", "content": f"{answer}.\n Topic is : {topic}"},
]}
output = chat
return output

##### In above code i am putting required data in chat format
train_data = [data_create(topic,question,answer) for topic,question,answer in zip(topic,dialogue,summary)]

after running above code ,this is how train data looks:

Train data one sample
## lets data above Train data in jsonl format
with open(datset_path+"train.jsonl", "w") as fid:
for t in train_data:
json.dump(t, fid)
fid.write("\n")


#### In save way we created test data
data = pd.read_csv("test.csv")
dialogue = list(data['dialogue'])
summary = list(data['summary'])
topic = list(data['topic'])

test_data = [data_create(topic,question,answer) for topic,question,answer in zip(topic,dialogue,summary)]
with open(datset_path+"test.jsonl", "w") as fid:
for t in test_data:
json.dump(t, fid)
fid.write("\n")

#### Validation data
data = pd.read_csv("validation.csv")
dialogue = list(data['dialogue'])
summary = list(data['summary'])
topic = list(data['topic'])

valid_data = [data_create(topic,question,answer) for topic,question,answer in zip(topic,dialogue,summary)]
with open(datset_path+"valid.jsonl", "w") as fid:
for t in valid_data:
json.dump(t, fid)
fid.write("\n")
Train Jsonl file

Lets look how jsonl file look. With this we are done with data preparation.
Note : no need to apply chat template of base model. this is handled by mlx code.

Model load and quantization

For this example, we are using Mistral-7B-Instruct-v0.2 as our base model
1. we will load this model and then save its quantized weights.

from mlx_lm.utils import *

convert(hf_path = "mistralai/Mistral-7B-Instruct-v0.2",
mlx_path = "mlx_path", ## path where you want to save quantized model
quantize=True) ## if false it will load full model

Lets check the model : load and generate sample output:

sample model test

Model parametes setup : MLX provide us optons to write lora_config.yml file where we can define all parameters rather then writing as python run.py — args… which is cumbersome to write so many.

Lets took at sample lora_config.yml file

# The path to the local model directory or Hugging Face repo.
model: "mlx_path"
# Whether or not to train (boolean)
train: true

# Directory with {train, valid, test}.jsonl files
data: 'data'

# The PRNG seed
seed: 0

# Number of layers to fine-tune
lora_layers: 19

# Minibatch size.
batch_size: 5

# Iterations to train for.
iters: 1800

# Number of validation batches, -1 uses the entire validation set.
val_batches: 25

# Adam learning rate.
learning_rate: 1e-5

# Number of training steps between loss reporting.
steps_per_report: 10

# Number of training steps between validations.
steps_per_eval: 20

# Load path to resume training with the given adapter weights.
resume_adapter_file: null

# Save/load path for the trained adapter weights.
adapter_file: null

# Save the model every N iterations.
save_every: 100

# Evaluate on the test set after training
test: false

# Number of test set batches, -1 uses the entire test set.
test_batches: 100

# Maximum sequence length.
max_seq_length: 32768

# Use gradient checkpointing to reduce memory use.
grad_checkpoint: false

# LoRA parameters can only be specified in a config file
## these are important parameters to tune
lora_parameters:
# The layer keys to apply LoRA to.
# These will be applied for the last lora_layers
keys: ["self_attn.q_proj", "self_attn.v_proj"]
rank: 16
alpha: 16.0
scale: 10.0
dropout: 0.05

# Schedule can only be specified in a config file, uncomment to use.
lr_schedule:
name: cosine_decay
warmup: 100 # 0 for no warmup
warmup_init: 1e-7 # 0 if not specified
arguments: [1e-5, 1000, 1e-7] # passed to scheduler

Once these paramets are set, its time to start training. Here we will use kernel to do so.
go to your presetent directory and run this command
~ python3 -m mlx_lm.lora — config lora_config.yaml

Training ran for 1800 steps.

Training output

Lets run LLM on by taking one sample example from above selected dataset:

 Input : "Summarize the following conversation and and also give topic to it .#Person1#: Happy Birthday, this is for you, Brian. #Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time. #Person1#: Brian, may I have a pleasure to have a dance with you? #Person2#: Ok. #Person1#: This is really wonderful party. #Person2#: Yes, you are always popular with everyone. and you look very pretty today. #Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel. #Person2#: You look great, you are absolutely glowing. #Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday."
Original Output : "#Person1# and Brian are at the birthday party of Brian. 
Brian thinks #Person1# looks great and is popular..
Topic is : birthday party"

----------------------------------------------------------------------------------
# Without trained adaptor
model,tokenizer = load(path_or_hf_repo ="mlx_path",
)
response = generate(model, tokenizer, prompt=text, verbose=True, temp=0.01, max_tokens=1300,)

### Output
"""Summary: Brian was greeted at the entrance of his birthday party by another
guest, Person 1. Person 1 wished him a happy birthday and expressed her
desire to dance with him. Brian agreed, and they both enjoyed the party.
Person 1 complimented Brian on his popularity and his appearance, and
they both agreed that it would be nice to have a drink together to celebrate."""
==========
Prompt: 628.578 tokens-per-sec
Generation: 64.615 tokens-per-sec


----------------------------------------------------------------------------------
#### With trained adaptor
model,tokenizer = load(path_or_hf_repo ="mlx_path",
adapter_path = adapt. # path to new trained adaptor
)
response = generate(model, tokenizer, prompt=text, verbose=True, temp=0.01, max_tokens=1300,)

###### Output
"""Brian's having a birthday party. #Person1# dances with him and compliments
him. They have a drink together to celebrate his birthday..
Topic is : birthday party"""
==========
Prompt: 632.099 tokens-per-sec
Generation: 52.530 tokens-per-sec

Verdict : we can cleary see without fine-tuning LLM is giving generalize summary but after fine-tuning, it has understood the ouput pattern and its style.

Comparison : LLM vs fine tune LLM based on Bert sore and Rouge score

from mlx_lm import load, generate


model, tokenizer = load(path_or_hf_repo = mixtral,#,tokenizer_config=tokenizer_config
adapter_path = adapt_path,
tokenizer_config={"trust_remote_code": True})


raw_model, tokenizer = load(path_or_hf_repo = mixtral,#,tokenizer_config=tokenizer_config
# adapter_path = adapt_path,
tokenizer_config={"trust_remote_code": True})

## Loading Test data
import json
data = []
with open('data/test.jsonl') as f:
for line in f:
data.append(json.loads(line))

trueans = []
fine_tune_prediction = []
normal_prediction = []
for i in tq(range(len(data[:]))):
#i = 210
inputtxt = data[i]['messages'][0]['content']
actual_txt = data[i]['messages'][1]['content']#.replace("<<ANALYSIS>>:","").replace("<</ANALYSIS>>:","")
# inputtxt = inputtxt.replace("(","").replace(")","")
prompt = model_pred(inputtxt)
response = generate(model, tokenizer, prompt=prompt, verbose=False, temp=0.01, max_tokens=300,)
response = response.replace("\n","")
trueans.append(actual_txt)
fine_tune_prediction.append(response)
response = generate(raw_model, tokenizer, prompt=prompt, verbose=False, temp=0.01, max_tokens=300,)
response = response.replace("\n","")
normal_prediction.append(response)
### Bert score
from evaluate import load

from evaluate import load
bertscore = load("bertscore")
predictions = fine_tune_prediction
references = trueans
results = bertscore.compute(predictions=predictions, references=references, lang="en")

from evaluate import load
bertscore = load("bertscore")
predictions = normal_prediction
references = trueans
not_finte_tuneresults = bertscore.compute(predictions=predictions, references=references, lang="en")


#### Rouge score
import evaluate

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
predictions=normal_prediction,
references=trueans,
use_aggregator=True,
use_stemmer=True,
)

peft_model_results = rouge.compute(
predictions=fine_tune_prediction,
references=trueans,
use_aggregator=True,
use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)
#
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
print(f'{key}: {value*100:.2f}%')

I run LLM and fine tunned LLM for my complete test data. Here these were the finding.
# Bert score:
LLM
precision : 0.846
recall : 0.86
f1 score : 0.85

Fine tuned LLM
precision : 0.911
recall : 0.923
f1 score : 0.914

Here is the comparison of Rouge score: here peft model is fine-tuned LLM,
and original model is based model.

We can see rouge metric has good jump after fine-tuning.

Thanks reading upto here…..
if you like it please upvote it and for any doubt , please write in the comment. I will try to answer it asap.

--

--