Efficient LLM Fine-Tuning: LoRA, DoRA, and Apple’s Innovative Approach

Yugen.ai
Yugen.ai Technology Blog
7 min readJun 23, 2024

Understanding the impact of LoRA Adapters and breaking down the Model adaptation optimization techniques to enhance the ability of Foundational Models.

By Soumanta Das

Introduction

Apple recently introduced its On-Device & Server Foundation Models that are highly capable and specialized for user’s daily tasks & interactions. Consequently, Apple Machine Learning Research also released technical details around these models, the pre-training, post-training phases and some innovative techniques to optimize for speed and efficiency & ease of deployment/shipping.

This blog focuses on understanding one such aspect of their design — using LoRA/DoRA Adapters on the fly for specific tasks, thereby enriching the capabilities of the foundational models.

Some of the content below are best guesses & analytical explorations, which may not precisely reflect Apple’s actual implementation. The intention is to use Apple’s blog as an inspiration to understand the technology better.

Context

At the 2024 Worldwide Developers Conference, Apple introduced Apple Intelligence, which comprises of multiple generative models that are specialized for various everyday tasks that users perform. These models are also capable of adapting on the fly based on the user’s current activity context. Fine tuned models which are part of Apple Intelligence power user experiences such as writing text, prioritizing and summarizing notifications, taking in-app actions to simplify interactions across different apps.

There are 2 models that Apple mentions in the aforementioned blog -

1. A ~3B on-device language model for tasks that can be done on-device.

2. A larger language model running on Apple Silicon Servers, for tasks that cannot be completed on the device.

Apple has implemented numerous innovative optimizations to enhance the speed and efficiency of their generative models, such as grouped-query-attention, shared embedding tables, and low-bit palletization. Additionally, the use of Adapters (each fine tuned for a specific feature) helps scale the capabilities of the foundation model effectively.

The on-device model on the iPhone 15 Pro achieves a time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second.

In this blog, we’ll understand how Adapters work and why they are so effective.

LLM Fine-Tuning & Challenges

Large Language models learn a lot about semantic concepts and language relationships from extensive training on large corpora of data, enabling them to understand and generate coherent text across various topics and contexts. This makes such models good at a broad range of tasks such as summarization, conversational interactions, translation and on-the-fly reasoning without any specific fine-tuning.

However, LLMs do not always perform well in all kinds of tasks, specially those that are domain specific or those that haven’t had enough examples during training. Fine tuning is one such approach to improve the accuracy of LLMs in such cases. New task specific data is fed to the model and the model’s weights are adjusted based on the calculated loss. This process uses Backpropagation, where the gradient of the loss function is computed w.r.t. the weights, till the loss reaches a minima. Fine-tuning is aimed at updating an LLM’s parametric memory/knowledge as opposed to alternate approaches such as RAGs, which improves the LLM’s performance using non-parametric memory/knowledge.

Since Full Fine-tuning involves updating all the weights in the model, it is time consuming and memory intensive.

Adapters for Efficient Fine-Tuning

Adapters enable an LLM to adapt to new scenarios without changing it’s original parameters. Imagine them to be like plug-and-play modules of neural networks, which are inserted into the pre-trained model’s layers. These adapter layers act as components that learn from the new task-specific data fed to the model for fine tuning.

This design has the following benefits

  • Existing capabilities/knowledge of the LLM are not lost since the original parameters are unchanged.
  • The adapter layers are smaller NNs, so the no. of parameters that need to be updated are significantly lower. Fine tuning can therefore be accomplished quicker with fewer compute resources.
  • No inference latency compared to fully fine-tuned models (some adapter patterns only)

A common and effective technique to train adapters is Low-Rank Adaptation (LoRA). There are similar variations of LORA (DORA, AdaLoRA, QLoRA etc). We’ll go over the details of these approaches in a separate blog. A very quick, highly simplified illustration of LoRA adaptation is shown below -

A very simplified illustration for LoRA fundamentals
Extremely simplified illustration of LoRA (for a 10k feet understanding only)

And now, we’ll deep dive on how adapters help scale the capabilities of foundational models in the context of the Apple Intelligence blog.

Adapters in On-device models

The Business Value of using Adapters

Memory is a constraint as far as deploying LLMs on devices such as phones, tabs & laptops are concerned. On one hand, an extremely large LLM with a higher base performance won’t be enough to support the diverse needs of users at scale, especially given the distribution Apple has. Similarly, multiple large LLMs, each fined tuned for a specific task is not a plausible way either, since these models will consume more memory on the device, which isn’t worth the increased performance across diverse tasks.

Adapters are usually in the range of MBs, so for Apple this approach makes a lot of sense. For specific tasks that power particular user/product/app experiences, they can ship new or updated versions adapters. Adapters are modular — this allows adapters to be added, removed from the base model on the fly as needed. Additionally, multiple adapters can be stacked/merged together. For example, summarization of unread mail threads + drafting a mail reply can be powered by 2 distinct adapters.

For the ~3 billion parameter on-device model, the parameters for a rank 16 adapter typically requires 10s of megabytes.

How Adaptation works

The blog mentions the following in terms of which layers are adapted -

For our models we adapt the attention matrices, the attention projection matrix, and the fully connected layers in the point-wise feedforward networks for a suitable set of the decoding layers of the transformer architecture.

Now, what could be the rationale behind choosing these layers?

By adapting the Q (Query), V (Value) projections, LoRA can influence how the model attends to different parts of the input. This fine-tunes how the model forms Q, V (via attention matrices adaptation). For example, consider the 2 sentences below

1. “Apple’s iPhone launch was a great success in New York”

2. “Apple pie is a great dessert popular in New York”

If the specific task at hand is Sentiment analysis, then LoRA would be fine-tuned to pay more attention to “great success” and “great dessert” in sentences 1 and 2 respectively. This would mean that the LoRA would adapt Q and V to emphasize on sentiment related words over named entities such as New York and Iphone.

On the other hand, if the specific task is NER (Named Entity Recognition), LoRA would adapt the attention mechanism to weigh entities such as New York, Apple more than sentiments i.e. dessert, success.

The MLX library provides a good insight into which layers are adapted for LoRA fine-tuning. Here’s a relevant snippet from the MLX codebase

elif model.model_type in [
"mistral",
"llama",
"phi",
"mixtral",
"stablelm",
"qwen2",
"qwen2_moe",
"gemma",
"starcoder2",
"cohere",
"minicpm",
]:
keys = set(["self_attn.q_proj", "self_attn.v_proj"])
if model.model_type == "mixtral":
keys.add("block_sparse_moe.gate")
if model.model_type == "qwen2_moe":
keys.add("mlp.gate")
keys.add("mlp.shared_expert_gate")

elif model.model_type == "gpt_bigcode":
keys = set(["attn.c_attn"])
elif model.model_type == "gpt2":
keys = set(["attn.c_attn"])
elif model.model_type == "olmo":
keys = set(["att_proj"])
elif model.model_type == "openelm":
keys = set(["attn.qkv_proj"])
elif model.model_type == "phi3":
keys = set(["self_attn.qkv_proj"])
elif model.model_type == "phi-msft":
keys = set(["mixer.Wqkv", "moe.gate"])
elif model.model_type == "dbrx":
keys = set(["norm_attn_norm.attn.Wqkv", "ffn.router.layer"])
elif model.model_type == "internlm2":
keys = set(["attention.wqkv", "attention.wo"])

Here’s a quick summary of what this code does -

After the specific layers to be adapted are determined, they are converted to LoRA/DoRA layers in the following manner.

for l in model.layers[num_layers - num_lora_layers :]:
lora_layers = [(k, to_lora(m)) for k, m in l.named_modules() if k in keys]
l.update_modules(tree_unflatten(lora_layers))

Apple has multiple such adapter models. Some of these revealed in their blog are

Sample Adapters (source — Apple Machine Learning Research)
Sample Adapters (source — Apple Machine Learning Research)

These models can be dynamically loaded, temporarily cached in memory, and swapped. This gives the foundation model the ability to specialise itself on the fly for the task at hand while efficiently managing memory.

Adapter Training/LLMOps

Apple also created an efficient infrastructure to retrain, test, and deploy adapters when the training data is updated or if the base model changes. This reinforces the importance and significance of ML/LLM platforms. These platforms abstract away critical concepts, making it easier for Data Scientists, ML Engineers to iterate quickly and deploy without having to manage the serving infra. From our experience, deployment patterns & complexities for adapters & LLMs are a bit different from traditional MLOps tools, but once built properly can give significant ROI.

--

--