rank-stabilized LoRA (rsLoRA): Help us improve fine-tuning Typhoon
In January 2024, our research circle came across an interesting paper by Damjan Kalajdzievski from one of our portfolio companies, Tenyx. We were fascinated by the idea it presented. Our team has conducted an experiment inspired by the findings in this paper, and is eager to share these insights with wider audiences. The paper I have mentioned is called “A Rank Stabilization Factor for Fine-Tuning with LoRA” (https://doi.org/10.48550/arXiv.2312.03732).
The key idea of this paper is about the scaling factor (𝛾ᵣ) of LoRA adapters. Briefly, from the LoRa method by Hu et al., 2021 (https://doi.org/10.48550/arXiv.2106.09685), they rewrote the weight matrices as 𝑊 + Δ𝑊, where 𝑊 is the frozen original weights and Δ𝑊 is called the adapter. Δ𝑊 can be decomposed into the trainable matrix product (𝐵𝐴) of at most rank 𝑟, and will have the scaling factor (𝛾ᵣ) to account for the rank effect. In turn, the weight matrices can be written as 𝑊 + 𝛾ᵣ𝐵𝐴.
Originally. 𝛾ᵣ is set as
, where 𝛼 is some hyperparameter.
Damjan suggested that by modifying it to
, we can improve the performance of LoRA fine-tuning.
We encourage all readers to read a detailed proof in Appendix A of Damjan’s paper to get the appreciation on the condition under which we will not make the gradient collapse (unstable) when 𝑟 is large (at higher rank).
The author also provided in the main paper a series of experimental validation including different pre-trained models, datasets, optimizers, learning rates, layers where adapters are applied to, other scaling factors, etc. The results agree with mathematical proof.
By dividing LoRA adapters by the square root of their rank, the gradient norm can maintain a relatively similar magnitude throughout the training process for each selected rank.
At SCB 10X, we have implemented this idea to fine-tune our Typhoon: Thai Large Language Models (https://doi.org/10.48550/arXiv.2312.13951). In the experiment (Figure 1), we used 𝑟=64, with normal LoRa (in red) the training loss did not decrease further after around 400 global steps. When we kept all hyperparameters the same except changing LoRa to rsLoRA (in green), the training loss could go down. In addition to the change from LoRA to rsLoRA, when we increased the rank from to 𝑟=64 to 𝑟=1,024 (in purple), the training loss went down even further, indicating that we didn’t lose the gradient, as was the case of normal LoRA.
Lastly, there is good news. On 20 February 2024, Damjan published a community blog post on this topic (https://huggingface.co/blog/damjan-k/rslora). The rank-stabilized LoRA (rsLoRA) method is now in Hugging Face’s PEFT package. Example codes were given in the post. Please make sure to check it out. Theoretically, this method is more effective at higher rank adapter than LoRA because it’s more stable. We are curious to see it being used more in real-world cases and would love to see if it could perform better in other situations. We encourage more people to try rsLoRA for fine-tuning and please let us know the feedback.
Happy fine-tuning, everyone!
Acknowledgements
I would like to acknowledge Kunat Pipatanakul, Lead Research Scientist of SCB 10X’s AI Open Innovation team, for providing me the experimental results of Typhoon fine-tuning. Special thanks go to Megan Khunakridatikarn (SCB 10X Lab) for editing and proofreading.