What is Low-Rank-Adaptation (LoRa)
LLMs are viewed as a crucial technology for developing innovative products and services, leading companies are making significant investments in them to leverage this new tech in their products/services and provide better value to their users. However, there is a huge problem with training and serving these large language models, it's expensive. According to some articles [1] [2] the cost of training these large models can range from millions to billions of dollars. Expensive isn’t it?!
But why not use already available models? After all, they have been trained on trillions of tokens, which ought to cover all use cases right? Not really.
For niche tasks, we may need to fine-tune these models on a specific dataset. Fine-tuning a model for niche tasks is similar to the difference between driving a car on regular roads versus driving on a race track. It is like you’re an experienced driver who knows how to navigate through city streets and highways with ease.
However, when it comes to driving on a professional race track, the dynamics change. You would need additional training, specific techniques, and knowledge of the track’s layout and conditions to optimize your performance and compete against skilled racers. A couple of practice laps and you’re all set!
Given a large number of parameters, it might not be feasible to retrain all parameters on a custom dataset. Not only that, imagine an organization that wants to use an LLM for law as well as medical subjects.
To have full domain expertise we might want to train two different models, one fully fine-tuned on legal data and another on medical data. Now, we have two models to serve; if we were to expand the use cases we have n models to serve. Serving millions or billions of parameters isn’t cheap.
Efforts have been made to improve the efficiency of training and serving LLMs. There has been some research on reducing the cost while maintaining accuracy. One method is to add additional layers to the model and train only these layers. This way the pre-trained model can be kept in memory and only the additional layers for each task needs to be kept in memory, this greatly boosts the operational efficiency of serving these models. Unfortunately, these techniques typically result in a trade-off between efficiency and model quality.
There is another technique that aims to solve this problem —
Low Rank Adaptation (LoRA)
Published by Microsoft, it proposes to freeze the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
What does ‘rank of a matrix’ mean?
Definition: The rank of matrix refers to the number of linearly independent rows or columns in matrix.
Or
It can also be defined as order of the highest ordered non-zero minor.
While reading the paper I went through this article to refresh my memory on how we can calculate the rank of a matrix. I also suggest you read about SVD and similar decomposition techniques.
The paper mentions an idea defined by Aghajanyan et al.
It states
pre-trained language models have a low “intrinsic dimension” and can still learn efficiently despite a random projection to a smaller subspace
What do we mean by this? It means that if there are 4 factors on which the solution of a problem depends there’d be 2 factors that contribute majorly towards the solution. Here is a simple explanation generated by LLM:
Suppose we have an objective function for a manufacturing process that aims to minimize production costs. The objective could be to find the optimal values for various parameters such as production quantity, labor hours, and raw material usage, in order to minimize the overall cost.
Now, the intrinsic dimension of this objective function represents the minimum number of parameters needed to obtain satisfactory or low-cost solutions. In this case, let’s assume that after analyzing the problem, it is found that the production quantity and labor hours are the major factors contributing to the cost.
Applying this to a neural network, when we fine-tune a model we update the entire matrix (in the case of a dense layer) during a forward operation. The computing cost depends directly upon the dimensionality of the matrix. LoRa suggested constraining the updates to this matrix with a low-rank decomposition matrix.
Which weight matrix should LoRa be applied to?
The paper only focuses on applying LoRa to the attention weights; here we still have query, key, value & outputs matrices.
Let's look at some numbers from their experiments.
The authors first set a budget, and proposed a budget of modifying around 18M parameters ( for GPT-3 175B); for adapting a single matrix the budget allows a rank of 8, whereas for two matrices,it allows a rank of 4 and a rank of 2 for all four matrices. Here are the results for the same:
But how does it compare to other techniques?
The authors suggest tuning r based on different datasets for optimal results.
If we consider GPT-3 175B, 18M should take up to 35 Mb (FP16). Practically, if we were to use LoRa for 100 tasks, it would need 350GB (base) + 35MB * 100 ≈ 354GB. However, regular FT would need 100 * 350GB ≈ 35TB (Which is HUGE).
Let me know if you have used this technique and benchmarked against regular fine-tuning on real-world use cases!
I gotta say, this paper was a real page-turner for me, and I hope you find my write-up just as exciting. I’d love to hear your thoughts on it, or if there are any other interesting points from the paper that caught your attention, feel free to share them with me. Let’s dive into this discussion!