Fine-tune LLMs on Laptop With QLoRA & MLX

Privacy-Preserving LLM without GPU

Deltaaruna
Effectz.AI
18 min readMar 26, 2024

--

1. Introduction

There are situations where you have to fine tune LLMs. Because for certain custom tasks fine tuning will give better results. Biggest issue with fine tuning of LLM is the cost. It is prohibitively expensive to fine tune a LLM for a particular task. And the issue does not end there. Assume that you have completely different data sets for completely different tasks. If each task requires a separate dataset to be trained, you might have to fine tune a LLM for each task. This will be extremely costly. Imagine you are a company that uses customized on-prem LLMs. You might need LLM to be trained on your sales data, data about internal workflows etc. For each task you might have to fine tune a LLM from scratch, since the data sets are quite different. This will be extremely costly and you might think of going back to cloud based AI despite the privacy concerns. Can we have a workaround for this situation?

In this post, I’ve explored the solution to the previously mentioned issue by outlining the process of fine-tuning Large Language Models (LLMs) on consumer-grade computers, like personal machines, employing techniques such as Low-Rank Adapters (LoRA) and QLora. I’ve provided an example of fine-tuning an LLM using LoRA and QLora, implemented with Apple’s MLX framework. All source codes pertinent to this discussion are available on GitHub. To engage with this content further, please clone the repository and continue with the post.

2. LoRA

2.1. What is LoRA

LoRA(Low-Rank Adaptation) tries to solve the above mentioned problem by introducing tiny adapters for each training data set. Idea behind LoRA is that a matrix can be decomposed into a low rank matrix, if there are repetitive patterns. Assume your LLM has 50 Billion parameters. LoRA is not going to represent the 50 Billion with 1 Billion parameters. What LoRA does is, it represents the gradient matrix of your LLM using a low rank matrix. Despite this low rank representation you get almost the same performance.

Here we do not need to decompose the matrix, instead we can learn the decomposed matrix via back propagation. If I put it another way, we have the big gradient matrix, usually we update this big gradient matrix during the fine tuning process. The big matrix can be decomposed into a smaller one, so the new number of trainable layers will be smaller. And we will not compromise the accuracy as well. This is the basic idea behind LoRA. Let’s discuss this further. A usually fine tuning looks like the following diagram.

  1. Forward Pass with Original Model: Here the input data, represented by x, is fed into the model. The pretrained weights, denoted as W, are used to process the inputs and generate the embeddings h.
  2. Obtain Weight Update via Backpropagation: In this step, backpropagation is used to calculate the change in weights, represented by ΔW. This is the adjustment that needs to be made to the original weights of the model based on the loss gradient.
  3. Forward Pass with Updated Model: The updated weights, W′, which are the result of applying the weight update ΔW to the original weights W, are now used in another forward pass with the inputs x to produce a new set of embeddings h.

Here W′ = W + ΔW . We can compute the outputs as h = W x + ΔW x. So we can represent the end result as follows.

If we freeze the W part then ΔW is calculated against the decomposed matrix. So we can conclude that considering weight updates, you do not need to consider billions of parameters in the original model. It can be represented with this small weight update matrix. For simplicity, let’s consider that LLMs are consists of fully connected layers.

2.2. Fully Connected Layers

A fully connected layer in a neural network connects every input to every output within that layer through a series of weights and biases. Mathematically, this operation can be represented as Y=XW+B.

Y=XW+B
X is the input matrix
W is the weight matrix
B is the bias vector
Y is the output matrix

This fully connected layers are considered as full-rank. In linear algebra, the rank of a matrix is the dimension of the vector space generated (or spanned) by its columns or rows. A matrix is considered to have full rank if its rank is as high as possible given its dimensions. For a matrix to be called full rank, all of its rows and columns must be linearly independent.

Since fully connected layers are considered full rank, it is not possible to decompose them to a low dimensional matrix. Because most of the rows and columns are unique. So…. are we at a dead end?

When these LLMs are fine tuned to a new task, something else can be observed. Although these models are very large, with billions of parameters, only a small subspace of the parameter space is relevant for any given task.

This observation is very interesting. Let’s think about it a little bit. These LLMs are trained on large amounts of data and have billions of parameters. Weights have been adjusted to capture as much of the structural elements of natural language as possible. The result is full rank matrices with little room for converting them into low rank matrices. But when we adapt it to a new task(fine-tuning), it can be done by adjusting a relatively small portion of the weights, while keeping the rest frozen. Although the weight matrix is full rank, the number of parameters that needs to be modified during the fine tuning is relatively low. The actual number of directions in the parameter space that needs to be changed is low compared to the actual number of parameters. This is because many of the pre trained patterns are general enough to apply across the tasks. Because of that only a specific substed needs to be changed. If I put it in another way, gradients (represented by gradient matrix) obtained during the fine tuning process do not span in the entire space. They actually occupy a lower dimensional space. So it can be decomposed into a low rank matrix! This is the trick behind LoRA.

Here the original weight block is W. Then there are two new weight matrices labeled Wₖ​ and Wₗ. These matrices represent a decomposition of the weight update ΔW. The decomposition involves breaking down the weight update matrix into two smaller matrices, where:

  1. Wₖ is a matrix of dimension K×r
  2. Wₗ is a matrix of dimension r×L

Here, r is the rank, which is typically much smaller than K or L, so we can call it low-rank.

2.3. Similarity with auto-encoder

In the context of dimensionality reduction, you can think how an auto-encoder works. An auto-encoder reduces data to a lower-dimensional space (encoding) and then reconstructs it back to the original space (decoding). In the case of LoRA, Wₖ​ and Wₗ can be seen as performing a similar dimensionality reduction and expansion considering the weight update process.

2.4. Use cases

LoRA makes fine tuning and utilizing of the LLMs really scalable. If you are using traditional fine tuning, you have to fine tune a model for each of your use cases. If you have ten use cases you will have to fine tune and run 10 different fine tuned LLMs. But now, we can use one LLM and train and run adapters for each use case.

2.5. Limitations

LoRA is making LLM fine tuning very efficient. Because these adapters will be at KB or MB level. But still complexity of the original modal is there. To fine tune the LoRA adapters you still have to work with the original LLM. It will take a lot of GPU memory. QLoRA, while using LoRA, can reduce memory requirements of the original modal. Basic idea here is using quantization techniques.

3. QLoRA

3.1 What is QLoRA?

QLoRA adds quantization to LoRA. Quantization basically means you do not need the numbers to be that precise. This is already used in converting analog to digital(ADC). The first step in ADC is sampling, which involves measuring the amplitude of the analog signal at regular intervals. The rate at which the signal is sampled is determined by the sampling frequency or rate. After sampling, each sampled value of the analog signal is assigned a digital value. Quantization is the process that does this. It involves rounding off the amplitude of each sampled value to the nearest value that can be represented by the digital system. This step introduces quantization error or noise, because the digital representation is an approximation of the analog signal.

  1. Discrete Levels: An ADC has a specific number of levels it can output, which is determined by its resolution, typically denoted in bits. For example, an 8-bit ADC can represent the input signal with any of 2⁸ (256) different levels.
  2. Rounding: Each continuous sample value is rounded to the nearest of these discrete levels. This is the actual quantization step. Since the number of discrete levels is finite, this rounding process introduces an error (called quantization error or quantization noise), because not every possible amplitude value can be represented exactly.
  3. Quantization Error: The error is the difference between the actual sampled value and the quantized value. The finer the quantization levels (more bits), the smaller the error. Conversely, a coarser quantization (fewer bits) leads to a larger quantization error. The quantization error can be thought of as a form of distortion or noise that is added to the signal during the quantization process.
  4. Encoding: Finally, each quantized value is encoded into a binary format. The number of bits used in this binary representation determines the resolution of the ADC. More bits allow for more discrete levels and finer resolution, which means a more accurate representation of the analog signal.

As you can see you can apply the same sampling theory to the LLMs. Because LLM representation of the memory can be simplified without sacrificing accuracy.

Here we can keep the LoRA layers in full precision(because it is so tiny) but for the LLM, you do not need to have 16 bit precision. 4 bit representation is enough! So the LLM will be converted into 4 bit precision and weights will be frozen.

QLoRA introduces a new quantization method with a new data type called 4 bit normal float. They also do double quantization to reduce the average memory footprint by quantizing the quantization constants.

This is similar to what happens in mix precision training. This is loss scaling, when you quantize something, you can keep the quantization constant then you can dequantize. There will be some error as well.

3.2 Loss Scaling in Mixed Precision Training

  1. Scale Up (Squashing): Before performing backpropagation, the computed loss is multiplied by a scaling factor. This scaling factor is a large number chosen to ensure that gradient values don’t become too small when using half-precision (16-bit floats). The small gradients could otherwise underflow and turn to zero due to the limited precision.
  2. Backpropagation: After scaling up the loss, the backpropagation algorithm calculates the gradients. Because the loss was scaled up, the gradients are also proportionally larger, avoiding the underflow problem.
  3. Scale Down (Dequantization): After the gradients are calculated, they are scaled back down by the same factor before they are applied to update the weights. This step is like “dequantization,” where the “squashed/Scaled up” values are brought back to their intended scale.

The QLoRA paper describes about using a “Quantization constant” is similar to the scaling factor used in loss scaling. This constant is crucial for mixed precision training. Because it prevents the loss of information during gradient computation. Also it ensures that the model continues to learn effectively, even with half-precision arithmetic.

3.3. Multiplication After Update

The multiplication by the scaling factor is done before computing the gradients, not after the update. The process is as follows

  1. Multiply (scale up) the loss by the constant before backpropagation.
  2. Compute the gradients in half-precision.
  3. Divide (scale down) the gradients by the constant.
  4. Apply the scaled-down gradients to update the full-precision weights.

3.4. Why Use loss scaling?

Without loss scaling, the gradients for many operations would be too small to represent with 16-bit floats, causing them to become zero (underflow). This would mean that no learning occurs for those weights. Loss scaling artificially increases the magnitude of the gradients to keep them within a range that half-precision can represent, ensuring the network can continue to learn. Also they have taken steps to take care of memory errors during the de-quantization process using pages optimizers. I took the below image from the original QLoRA paper. It summarizes everything in a single image.

Full Fine-tuning (No Adapters): This requires maintaining a large optimizer state in 32-bit to keep track of various statistics and states such as gradients, weight updates, and running averages. Here the entire model is being updated, this approach is memory-intensive.

LoRA Fine-tuning(Adapters): LoRA applies low-rank matrix updates to specific layers of the transformer. Because of that, “LoRA blocks” are more focused and require maintaining a smaller state. This method decreases the memory requirements compared to full fine-tuning since it targets a limited set of parameters.

QLoRA Fine-tuning(Quantization+Adapters): This extreme quantization greatly reduces the memory required for storing model weights. QLoRA manages memory spikes using paged optimizers.

4. Implementation with Apple MLX

The full implementation of the LLM fine-tuning using the Apple ML Compute (MLX) framework is outlined below. The entire source code for the example is available for access and review on GitHub.

4.1 Download LLM

First, I have downloaded existing model from Hugging Face. The save_model function defined in the util.py handles the model download function. It supports to download varieties of LLMs including llama, mistral, phi and mixtral.



import glob
import json
import logging
from pathlib import Path
from typing import Generator

import mlx.core as mx
import mlx.nn as nn
import models.llama as llama
import models.mixtral as mixtral
import models.phi2 as phi2
import transformers
from huggingface_hub import snapshot_download

# Constants
MODEL_MAPPING = {
"llama": llama,
"mistral": llama, # mistral is compatible with llama
"phi": phi2,
"mixtral": mixtral,
}


def _get_classes(config: dict):
"""
Retrieve the model and model args classes based on the configuration.

Args:
config (dict): The model configuration.

Returns:
A tuple containing the Model class and the ModelArgs class.
"""
model_type = config["model_type"]
if model_type not in MODEL_MAPPING:
msg = f"Model type {model_type} not supported."
logging.error(msg)
raise ValueError(msg)

arch = MODEL_MAPPING[model_type]
return arch.Model, arch.ModelArgs


def fetch_from_hub(hf_path: str):
model_path = snapshot_download(
repo_id=hf_path,
allow_patterns=["*.json", "*.safetensors", "tokenizer.model"],
)
weight_files = glob.glob(f"{model_path}/*.safetensors")
if len(weight_files) == 0:
raise FileNotFoundError("No safetensors found in {}".format(model_path))

weights = {}
for wf in weight_files:
weights.update(mx.load(wf).items())

config = transformers.AutoConfig.from_pretrained(hf_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(
hf_path,
)
return weights, config.to_dict(), tokenizer


def make_shards(weights: dict, max_file_size_gibibyte: int = 15):
max_file_size_bytes = max_file_size_gibibyte << 30
shards = []
shard, shard_size = {}, 0
for k, v in weights.items():
if shard_size + v.nbytes > max_file_size_bytes:
shards.append(shard)
shard, shard_size = {}, 0
shard[k] = v
shard_size += v.nbytes
shards.append(shard)
return shards


def save_model(save_dir: str, weights, tokenizer, config):
save_dir = Path(save_dir)
save_dir.mkdir(parents=True, exist_ok=True)

shards = make_shards(weights, max_file_size_gibibyte=5)
shards_count = len(shards)
shard_file_format = (
"model-{:05d}-of-{:05d}.safetensors"
if shards_count > 1
else "model.safetensors"
)

for i, shard in enumerate(shards):
shard_name = shard_file_format.format(i + 1, shards_count)
mx.save_safetensors(str(save_dir / shard_name), shard)

tokenizer.save_pretrained(save_dir)

with open(save_dir / "config.json", "w") as fid:
json.dump(config, fid, indent=4)


def load(path_or_hf_repo: str):
# If the path exists, it will try to load model form it
# otherwise download and cache from the hf_repo and cache
model_path = Path(path_or_hf_repo)
if not model_path.exists():
model_path = Path(
snapshot_download(
repo_id=path_or_hf_repo,
allow_patterns=["*.json", "*.safetensors", "tokenizer.model"],
)
)

with open(model_path / "config.json", "r") as f:
config = json.loads(f.read())
quantization = config.get("quantization", None)

weight_files = glob.glob(str(model_path / "*.safetensors"))
if len(weight_files) == 0:
raise FileNotFoundError("No safetensors found in {}".format(model_path))

weights = {}
for wf in weight_files:
weights.update(mx.load(wf).items())

model_class, model_args_class = _get_classes(config=config)
model_args = model_args_class.from_dict(config)
model = model_class(model_args)
if quantization is not None:
nn.QuantizedLinear.quantize_module(
model,
**quantization,
linear_class_predicate=lambda m: isinstance(m, nn.Linear)
and m.weight.shape[0] != 8,
)

model.load_weights(list(weights.items()))

mx.eval(model.parameters())
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
return model, tokenizer, config


def generate(
prompt: mx.array, model: nn.Module, temp: float = 0.0
) -> Generator[mx.array, None, None]:
"""
Generate text based on the given prompt and model.

Args:
prompt (mx.array): The input prompt.
model (nn.Module): The model to use for generation.
temp (float): The temperature for sampling. If temp is 0, use max sampling.

Yields:
mx.array: The generated text.
"""

def sample(logits: mx.array) -> mx.array:
return (
mx.argmax(logits, axis=-1)
if temp == 0
else mx.random.categorical(logits * (1 / temp))
)

y = prompt
cache = None
while True:
logits, cache = model(y[None], cache=cache)
logits = logits[:, -1, :]
y = sample(logits)
yield y

4.2 Quantize LLM

The next step is quantization of the LLM. The quantize function defined in save.py handle the LLM quantization. It takes three parameters: weights, config, and args.

import argparse
import copy

import mlx.core as mx
import mlx.nn as nn
import utils
from mlx.utils import tree_flatten


def quantize(weights, config, args):
quantized_config = copy.deepcopy(config)

# Get model classes
model_class, model_args_class = utils._get_classes(config=config)

# Load the model:
model = model_class(model_args_class.from_dict(config))
model.load_weights(list(weights.items()))

# Quantize the model:
nn.QuantizedLinear.quantize_module(
model,
args.q_group_size,
args.q_bits,
linear_class_predicate=lambda m: isinstance(m, nn.Linear)
and m.weight.shape[0] != 8,
)

# Update the config:
quantized_config["quantization"] = {
"group_size": args.q_group_size,
"bits": args.q_bits,
}
quantized_weights = dict(tree_flatten(model.parameters()))

return quantized_weights, quantized_config


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Convert Hugging Face model to MLX format"
)
parser.add_argument(
"--hf-path",
type=str,
help="Path to the Hugging Face model.",
)
parser.add_argument(
"--mlx-path",
type=str,
default="mlx_model",
help="Path to save the MLX model.",
)
parser.add_argument(
"-q",
"--quantize",
help="Generate a quantized model.",
action="store_true",
)
parser.add_argument(
"--q-group-size",
help="Group size for quantization.",
type=int,
default=64,
)
parser.add_argument(
"--q-bits",
help="Bits per weight for quantization.",
type=int,
default=4,
)
parser.add_argument(
"--dtype",
help="Type to save the parameters, ignored if -q is given.",
type=str,
choices=["float16", "bfloat16", "float32"],
default="float16",
)

args = parser.parse_args()

print("[INFO] Loading")
weights, config, tokenizer = utils.fetch_from_hub(args.hf_path)

dtype = mx.float16 if args.quantize else getattr(mx, args.dtype)
weights = {k: v.astype(dtype) for k, v in weights.items()}
if args.quantize:
print("[INFO] Quantizing")
weights, config = quantize(weights, config, args)

utils.save_model(args.mlx_path, weights, tokenizer, config)

Inside the function, it creates a deep copy of the config object, ensuring that the original config is not modified during the quantization process.

quantized_config = copy.deepcopy(config)

Then code retrieves the model class and its arguments class from the config using a utility function (_get_classes). Here the model’s class and its initialization arguments are configurable and stored in the config dictionary.

model_class, model_args_class = utils._get_classes(config=config)

Then, the script is initializing the model using the classes retrieved earlier. It converts the config dictionary into an arguments object and then loads the pre-trained weights into the model.

model = model_class(model_args_class.from_dict(config))
model.load_weights(list(weights.items()))

Then we do the actual quantization. The QuantizedLinear.quantiz_module method is called, which quantizes the model. The args.q_group_size and args.q_bits specify the details of the quantization. The linear_class_predicate is a lambda function that returns True for layers that should be quantized. It’s checking that the module is an instance of nn.Linea and its weight’s first dimension is not equal to 8. This is done to exclude certain layers (like output layers for classification tasks with 8 classes) from quantization.

nn.QuantizedLinear.quantize_module(
model,
args.q_group_size,
args.q_bits,
linear_class_predicate=lambda m: isinstance(m, nn.Linear)
and m.weight.shape[0] != 8,
)

Finally we update the config and obtain the quantized weights of the model.

# Update the config:
quantized_config["quantization"] = {
"group_size": args.q_group_size,
"bits": args.q_bits,
}
quantized_weights = dict(tree_flatten(model.parameters()))

4.3 Fine-tune LLM

Now the Quantize(Q) part of the QLoRA is done. Next is the LoRA part. That is creating low rank adapters from fine-tuning data. The train function of the train.py handles the saving adapters as below.

from pathlib import Path
import json
import time
import numpy as np

import mlx.nn as nn
import mlx.core as mx
from mlx.utils import tree_flatten, tree_unflatten

class Dataset:
"""
Light-weight wrapper to hold lines from a jsonl file
"""

def __init__(self, path: Path, key: str = "text"):
if not path.exists():
self._data = None
else:
with open(path, "r") as fid:
self._data = [json.loads(l) for l in fid]
self._key = key

def __getitem__(self, idx: int):
return self._data[idx][self._key]

def __len__(self):
return len(self._data)

def load(args):
def load_and_check(name):
dataset_path = Path(args.data) / f"{name}.jsonl"
try:
return Dataset(dataset_path)
except Exception as e:
print(f"Unable to build dataset {dataset_path} ({e})")
raise

names = ("train", "valid", "test")
train, valid, test = (load_and_check(n) for n in names)

if args.train and len(train) == 0:
raise ValueError(
"Training set not found or empty. Must provide training set for fine-tuning."
)
if args.train and len(valid) == 0:
raise ValueError(
"Validation set not found or empty. Must provide validation set for fine-tuning."
)
if args.test and len(test) == 0:
raise ValueError(
"Test set not found or empty. Must provide test set for evaluation."
)
return train, valid, test

def loss(model, inputs, targets, lengths):
# Run model on inputs
logits, _ = model(inputs)
logits = logits.astype(mx.float32)

# Mask padding tokens
length_mask = mx.arange(inputs.shape[1])[None, :] < lengths[:, None]

# Calculate the loss
ce = nn.losses.cross_entropy(logits, targets) * length_mask
ntoks = length_mask.sum()
ce = ce.sum() / ntoks
return ce, ntoks


def evaluate(model, dataset, loss, tokenizer, batch_size, num_batches):
all_losses = []
ntokens = 0
for it, batch in zip(
range(num_batches),
iterate_batches(dataset, tokenizer, batch_size),
):
losses, toks = loss(model, *batch)
all_losses.append((losses * toks).item())
ntokens += toks.item()

return np.sum(all_losses) / ntokens

def iterate_batches(dataset, tokenizer, batch_size, train=False):
# Shuffle indices
while True:
indices = np.arange(len(dataset))
if train:
indices = np.random.permutation(indices)

# Collect batches from dataset
for i in range(0, len(indices) - batch_size + 1, batch_size):
# Encode batch
batch = [tokenizer.encode(dataset[indices[i + j]]) for j in range(batch_size)]
lengths = [len(x) for x in batch]

# Check if any sequence is longer than 2048 tokens
if max(lengths) > 2048:
print(
"[WARNING] Some sequences are longer than 2048 tokens. "
"You can pre-split your data to save memory."
)

# Pad to the max length
batch_arr = np.zeros((batch_size, max(lengths)), np.int32)

for j in range(batch_size):
batch_arr[j, : lengths[j]] = batch[j]
batch = mx.array(batch_arr)
yield batch[:, :-1], batch[:, 1:], mx.array(lengths)

if not train:
break

def train(model, training_set, validation_set, optimizer, loss, tokenizer, args):
# Create value and gradient function for loss
loss_value_and_gradient = nn.value_and_grad(model, loss)

losses = []
n_tokens = 0

# Main training loop
start = time.perf_counter()
for it, batch in zip(
range(args.iters),
iterate_batches(training_set, tokenizer, args.batch_size, train=True),
):
# Forward and backward pass
(lvalue, toks), grad = loss_value_and_gradient(model, *batch)

# Model update
optimizer.update(model, grad)
mx.eval(model.parameters(), optimizer.state, lvalue)

# Record loss
losses.append(lvalue.item())
n_tokens += toks.item()

# Report training loss
if (it + 1) % args.steps_per_report == 0:
training_loss = np.mean(losses)

stop = time.perf_counter()
print(
f"Iteration {it + 1}: Trainnig loss {training_loss:.3f}, "
f"Iterations/sec {args.steps_per_report / (stop - start):.3f}, "
f"Tokens/sec {float(n_tokens) / (stop - start):.3f}"
)
losses = []
n_tokens = 0
start = time.perf_counter()

# Report validation loss
if it == 0 or (it + 1) % args.steps_per_eval == 0:
stop = time.perf_counter()
validation_loss = evaluate(
model, validation_set, loss, tokenizer, args.batch_size, args.val_batches
)
print(
f"Iteration {it + 1}: "
f"Validation loss {validation_loss:.3f}, "
f"Validation took {(time.perf_counter() - stop):.3f}s"
)

start = time.perf_counter()

# Save adapter weights
if (it + 1) % args.save_every == 0:
mx.savez(
args.adapter_file, **dict(tree_flatten(model.trainable_parameters()))
)
print(f"Iteration {it + 1}: Saved adapter weights to {args.adapter_file}.")

First it computes both the loss value and its gradients.

loss_value_and_gradient = nn.value_and_grad(model, loss)

Begins the main training loop, iterating over batches of data. The iterate_batches function generates batches of data from the training set, tokenizes the input, and prepares it for training. The loop runs for a number of iterations specified by args.iters.

for it, batch in zip(
range(args.iters),
iterate_batches(training_set, tokenizer, args.batch_size, train=True),
):

For each batch, computes the loss and the number of tokens (toks) as well as the gradients (grad) with respect to the model's parameters.

(lvalue, toks), grad = loss_value_and_gradient(model, *batch)

Updates the model’s weights using the gradients and the optimizer and then evaluates it.

optimizer.update(model, grad)
mx.eval(model.parameters(), optimizer.state, lvalue)

Records the loss value for the current batch and updates the total count of processed tokens.

losses.append(lvalue.item())
n_tokens += toks.item()

Finally it saves the adapters.

mx.savez(
args.adapter_file, **dict(tree_flatten(model.trainable_parameters()))
)

5. Run application

Here are the essential steps to execute the application: check out the source code, download the model, generate training data, train the model, and finally, run it.

# clone the source
git clone https://github.com/deltaaruna/qlora-mlx.git


---


# install dependencies
pip install -r requirements.txt


---


# download model from huggingface and save it as 4-bit quantized model
# command
python save.py --hd-path <hf_repo> -q --mlx-path <location>

# example
# download `TinyLlama-1.1B-Chat-v1.0` and save the quantized model in your current directory
python save.py --hf-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 -q --mlx-path ./TinyLlama-1.1B-Chat-v1.0


---


# generate tranning data
brew install chenhunghan/homebrew-formulae/mlx-training-rs
export OPENAI_API_KEY=[Your openai key]
mlxt --topic="the topic you are interested" -n=50

# example use case to generate data set for cyber security case laws
mlxt --topic="cyber security case laws of USA" -n=100


---


# train
# command
python run.py --train --model <model_location> --data <data_location> --batch-size <batch_size> --lora-layers <layers>

# example train TinyLlama-1.1B-Chat-v1.0 model
python run.py --train --model ./TinyLlama-1.1B-Chat-v1.0 --data ./data --batch-size 1 --lora-layers 4


---


# run and ask question
# command
python run.py --model <model_location> \
--adapter-file <adapter_file_location> \
--max-tokens <> \
--prompt <>

# example question asked from llm
python run.py --model ./TinyLlama-1.1B-Chat-v1.0 \
--adapter-file ./adapters.npz \
--max-tokens 50 \
--prompt "
Q: What are the legal implications of a cyber attack under the cyber security case laws of the USA?
A: "

# answer provided by llm
Key criteria used in determining penalties for cyber security violations under US case law include:
1. Injury to interstate commerce
2. Certain financial losses
3. Imminent dange

⭐️ Follow me on LinkedIn or Twitter for updates on AI ⭐️

I’m currently the Co-Founder & CEO @ Effectz.AI. We specialize in Privacy Preserving AI Solutions & AI Consulting.

References

  1. https://arxiv.org/abs/2106.09685
  2. https://arxiv.org/abs/2305.14314
  3. https://lightning.ai/pages/community/tutorial/lora-llm
  4. https://huggingface.co/docs/peft/main/en/conceptual_guides/lora
  5. https://huggingface.co/blog/4bit-transformers-bitsandbytes
  6. https://ml-explore.github.io/mlx/build/html/index.html

--

--