LoRA: The Efficient Fine-Tuning for LLMs You Should Know

Chase Roberts
Vertex Ventures US
Published in
3 min readDec 6, 2023

Fine-tuning remains essential to achieving the required performance for developing AI-powered applications. But fine-tuning is expensive and difficult to scale. In production applications, you might have multiple fine-tuned versions of pre-trained models for all of your users or scenarios. Each version requires checkpoints, which are essentially snapshots of the model’s parameters (weights and biases). A typical checkpoint for large language models (LLMs) with billions of parameters could be hundreds of gigabytes to several terabytes.

Source: Midjourney

Imagine your application has 10,000 users, each with 100 tasks, and assume the checkpoint is one terabyte. This example represents one million terabyte checkpoints! When switching tasks, these checkpoints would be a nightmare to store and load at deployment time. LoRA was born from a need to find a practical solution to achieving the performance benefits of fine-tuning without the operational burden. We interviewed Edward Hu, one of LoRA’s creators and co-author of LoRA: Low-Rank Adaptation of Large Language Models, on episode 7 of Neural Notes, which you can view here.

Checkpoints are large because they track all the changes to parameters made during fine-tuning. What if you could avoid changing all parameters and achieve approximately the same performance? This question is what Ed and his co-creators sought to answer. They drew inspiration from two papers (linked here and here), which caused them to adapt a constrained representation of the model in a computationally efficient way. From the paper:

We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

They use a technique called matrix rank decomposition. In this context, “rank” refers to the number of layers or parameters that are frozen (not updated) during the fine-tuning process. Rank decomposition is a mathematical trick: Instead of directly tweaking the many knobs (d times d where d is the hidden size) in the dense layers, they approximate the changes we want to make using fewer parameters. Imagine wanting to adjust 1,000 knobs but instead find a way to get a similar result by adjusting just 20 super-knobs. Remarkably, this analogy isn’t entirely representative. LoRA demonstrates you can achieve comparable performance to fully fine-tuning a 175 billion parameter model by adjusting 0.1% of the total parameters. So our 1k knob example only requires one super-knob.

Source: Ed Hu

The results are astounding! Compared to a fully fine-tuned version of GPT-3 175B, they reduced the memory consumption — thanks to smaller checkpoints — by a factor of 10,000 (350GB to 35MB) and sped up training by 25%. The performance degradation was effectively null! 🤯

As Sandeep stated during the interview:

This is one of the surprising results of the paper: a small ‘R’ works pretty darn well.

In a production context, the checkpoints are much smaller — meaning you can store thousands of these smaller 35MB checkpoints in RAM, move them to vRAM, and quickly switch the model to a new task.

The lightweight checkpoints introduce new possibilities. Imagine a world where you’ve LoRA fine-tuned an entire domain like medicine. The rank would be slightly higher for the domain-level model, meaning you’ve updated more parameters. You could have a tree of LoRA modules specialized for sub-domains, tasks, users, etc. — this is analogous to inheritance in object-oriented programming. Examples include a model that diagnoses specific categories of diseases or models trained for individual patients. These would have lower ranks. You’re never doing full fine-tuning, everything is stored efficiently, and the base model with generic knowledge is stored only once. The future is exciting!

You can and should follow Ed on X/Twitter at @edwardjhu.

--

--