Don’t Merge Your LoRA Adapter Into a 4-bit LLM

16-bit and quantization-unaware adapters

Benjamin Marie
7 min readNov 20, 2023

LoRA is a method for parameter-efficient fine-tuning (PEFT). It adds a small amount of trainable parameters on top of a frozen large language model (LLM). Since only the added parameters are trained, LoRA saves a lot of memory.

QLoRA is an even more memory-efficient method as it quantizes the base LLM on top of which the trainable parameters are added.

Typically, during QLoRA training, only the adapter’s parameters are saved. Then, we have two different methods to use the adapter for inference:

  • Loading it on top of the base LLM
  • Merging it into the base LLM

Preserving the base model while loading the adapter is convenient as we can easily replace the adapter with another one, almost seamlessly. Also since adapters are small, they are easy to store and distribute.

We can also merge the LoRA adapter into the base LLMs, for instance, to make the model easier to use, to hide the adapter, or to facilitate the training of yet another adapter on top of the merged model. The authors of LoRA demonstrated that merging a LoRA adapter into the base model can be perfectly done, i.e., without performance loss.

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/