What rank (r) and alpha to use in LoRA in LLM fine-tuning?

Farty Pants
4 min readNov 7, 2023

--

Rank relates to the amount of trainable parameters LoRA will be able to use during training.

Lora rank 256 adds about 3% of trainable parameters on 4-bit quantized LLaMA 13B model. (Note, quantization essentially halved the model’s effective numbers of parameters from 13B to 6.6B)

Generally ranks bellow 32 are more like a smudges on a glass with a dirty rag. Very low values like 8 or 4 are basically just telling the model a style of how you want the output to be formatted to but not care about the details as there isn’t enough parameters to distinguish much more.

For example an instruct fine tuning on a base model (such as Guanaco which used QLoRA) would be low rank because you are not trying to teach the model any new info or high concepts, you are just trying to teach it the size and type of text competition when you start playing the old “I am the user, and you are the helpful assistant!” game.

If you ask a question, you want the model to answer it, not continue asking another question — which would be the situation without fine-tuning.

(Something that still many people have hard time to understand — a base model finetuning is to utilize the model vast, but very unorganized and messy knowledge, not to teach it any new concepts)

Similarly, if you want your fine-tuned model to give you longer responses, you may fine-tune LoRA on top of it with the type of long responses you want, with a very low rank (so model doesn’t necessary learn the details from the text). Same if you want your model to respond in bullet points. You get the idea.

However, if you are trying to teach your model a bit of new stuff or concepts and new associations beyond what it already knows, you need to use higher rank.

In practice, with a “normal Joe” dataset YOU can use at home you are probably ceiling at around 256 rank in my own experiments. If you put too high rank but not have enough varied data or just not have enough data in general, the overall quality will in fact go down, not to mention that rank will directly affect GPU VRAM use, so in practice higher rank will have to be on account of lowering some other parameters (such as batch, or maximum context size).

The way why we call it low-rank approximation, with the emphasis on approximation, is the reason LORA fine-tuning will add knowledge, but only so-so and more conceptualized, where it works mostly if it supplements the model’s own pre-trained data.

There is a slight misconception what a “new knowledge” is.

“But I can teach my model that it is named Doris and has red hair and lives in Kingston! It’s a new knowledge!”

Well, it isn’t necessary a super new revelation as the model knows about all these words. Red or hair is not a mystery, nor the fact that things and places have names. You just put them together and LoRA is a good vehicle for that concept. That also includes new names and words, if you repeat them a few times. In that way model will easily accept saying it’s name is ZmXY32 as that as a “name” concept is not too different from Doris or Kirk.

But: If you try to use LORA to teach model a new made up language — it will be making stuff up too. You didn’t teach it a new language, you taught it to “make up” gibberish that sounds like the same gibberish you fine-tuned it but it really isn’t following any rules because there isn’t enough data and parameters to derive the rules from and to.

That is true for everything else you try to teach the pre-trained model! The closer your concept is to the knowledge the model already knows the better and more accurate the result will be. The further it is, the more readily will model try to make stuff up.

It is a fine line between telling model “this is new knowledge that contradicts everything you know” and “hey, I love when you tell me lies”, but that would be a theme for another time.

Alpha is a scaling parameter.

alpha = rank is scaling weights at 1.0

What you train in LORA weights will be then merged with the main weights of model at x 1.0

Previously people were suggesting alpha = (2 x rank), which is like yelling at your model really loud — all in order to make the newly learned weights “louder” than the model’s own. That requires a really good and large dataset, otherwise you are just amplifying nonsense.

The model knows how to speak well already, while your dataset is too small to teach (or scream at) the model any language fundamentals. Increasing alpha amplifies everything, not just the stuff you wish the model learns from it.

I would suggest rank = alpha, most of the time as your base — because it is very easily to attenuate the LORA data after the training is done if it appears to be too “loud”, overtaking the entire model.

It of course depends on the data and the effect you want to achieve.

Still, Alpha is one of the very few parameters (probably the only one) that can be lowered after the training without much of downside.

For more details, look at my WebUI extension that allows you to scale down LORA alpha in a few different ways:

https://github.com/FartyPants/Playground

--

--