Memory-efficient LLM Training with GaLore
A novel approach for full-parameter finetuning. How to use it and what to expect.
Training Large Language Models (LLMs), even those with “only” 7 billion parameters, is a computationally intensive task. Traditionally, this level of training required resources beyond the reach of most individual enthusiasts. To bridge this gap, parameter-efficient methods such as low-rank adaptation (LoRA) have emerged, enabling fine-tuning of substantial models on consumer GPUs.
GaLore
Enter GaLore, a novel approach detailed in the publication GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.
GaLore reduces VRAM requirements not by decreasing the number of parameters directly but by optimizing how these parameters are trained.
GaLore focuses on two primary strategies:
- Gradient Low-Rank Projection: GaLore shifts away from handling the full, high-dimensional gradients of weight matrices. Instead, it projects these gradients onto a low-rank space, significantly reducing the computational load while retaining essential information for training.
- Per-Layer Weight Updates: Unlike the conventional method where an optimizer updates all layers simultaneously after backpropagation, GaLore implements updates on a per-layer basis during backpropagation. This approach further reduces the memory footprint throughout the training process.
Just like LoRA, GaLore allows us to finetune a 7B model on a consumer GPU with 24 GB VRAM. The performance of the resulting model is comparable to a full parameter finetune and appears to outperform LoRA.
This guide shows you how to use GaLore with the 🤗 Trainer
and what to expect.
Prerequisites
Install the GaLore optimizer, either with pip install
or download from GaLore GitHub repo.
pip install galore-torch
Make sure to use the latest Hugging Face packages. The following versions have been used for this guide.
accelerate==0.28.0
bitsandbytes==0.43.0
datasets==2.18.0
transformers==4.39.1
trl==0.8.1
torch==2.2.1
Load model and dataset
If you are familiar with finetuning models with the Hugging Face suite, this will look familiar to you and you will probably have your own code and routines setup that might look different.
The following is a simple example using TRL’s SFTTrainer
(a subclass of Trainer
) to 7B base model of the llama2 family, set it up for ChatML and load the Open Assistant 2 dataset, a collection of human curated and rated conversations. The code below runs on a 24 GB VRAM GPU like a RTX 3090/4090.
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import setup_chat_format
from datasets import load_dataset
import torch
modelpath = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
modelpath,
torch_dtype = torch.bfloat16,
attn_implementation = "flash_attention_2",
device_map = "auto",
use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
model, tokenizer = setup_chat_format(model, tokenizer)
if tokenizer.pad_token in [None, tokenizer.eos_token]:
tokenizer.pad_token = tokenizer.unk_token
dataset = load_dataset("g-ronimo/oasst2_top4k_en")
Setup 🤗 Trainer for GaLore
The GaLore optimizer comes with a few hyperparameters we need to pass on to the 🤗 Trainer.
from transformers import TrainingArguments
# GaLore hyperparameters
rank = 1024
update_proj_gap = 200
scale = 2
training_arguments = TrainingArguments(
output_dir = f"out",
evaluation_strategy = "steps",
label_names = ["labels"],
per_device_train_batch_size = 16,
save_steps = 250,
eval_steps = 250,
logging_steps = 1,
learning_rate = 1e-5,
num_train_epochs = 3,
lr_scheduler_type = "constant",
gradient_checkpointing = True,
optim = "galore_adamw_8bit_layerwise",
optim_target_modules = ["attn", "mlp"],
optim_args = f"rank={rank}, update_proj_gap={update_proj_gap}, scale={scale}",
)
General training hyperparameters
per_device_train_batch_size
We train with a batch size of 16, this is the maximum number samples that will fit into 24 GB of VRAM.save_steps
,eval_steps
The dataset contains 4000 samples, with a batch size of 16 one epoch is completes every 250 steps. We therefore set the trainer up to evaluate and save after each epoch.learning_rate
Something around 10e–5 is what is typically used to finetune a 7B model.gradient_checkpointing
Saves memory and allows for a bigger batch sizes but slows down training.
GaLore hyperparameters
optim
Tells the Trainer to use GaLore as optimizer._layerwise
additionally specifies that the updates are made per layer which saves VRAM.optim_target_modules
Specifies the layers targeted by GaLore, primarily the linear layers identified withattn
ormlp
in their names.rank
The rank of the projection matrices. Similar to LoRA, the higher the rank the more closely the finetuning will resemble a full parameter finetune. The GaLore authors recommend1024
for a 7B model.update_proj_gap
The number of steps after which the projections are updated. The update is an expensive step and takes around 15 minutes for a 7B model. Defines the interval for updating projections, with a suggested range between 50 and 1000 steps.scale
A scale factor akin to LoRA’s alpha, adjusting the update strength. After trying a few values I foundscale=2
to most closely resemble a classic full-parameter finetune.
Start training
With the optimizer set up we can start training with the Hugging Face Trainer.
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset["train"],
eval_dataset = dataset['test'],
data_collator = DataCollatorForCompletionOnlyLM(
instruction_template = "<|im_start|>user",
response_template = "<|im_start|>assistant",
tokenizer = tokenizer,
mlm = False),
max_seq_length = 256,
dataset_kwargs = dict(add_special_tokens = False),
args = training_arguments,
)
trainer.train()
All the code above can be found in a notebook here.
GaLore versus full parameter finetune
The training loss with the given hyperparameters closely mimics the trajectory of a full parameter tuning, indicating that the GaLore layerwise approach is indeed equivalent.
Also when it comes to the numbers, the model trained with GaLore scores very similar to a full parameter finetune.
Memory and Speed
While GaLore introduces a VRAM savings of approximately 15 GB, enabling its operation on a single RTX 3090/4090 GPU, it does takes longer to train due to periodic projection updates.
GaLore vs. LoRA
GaLore stands as another parameter-efficient training method, alongside LoRA. Both approaches yield similar outcomes in terms of loss, performance benchmarks, and VRAM efficiency, with GaLore slightly edging out in the benchmarks.
Wrap up
- GaLore is a novel approach to approximate full parameter training
- Saves VRAM, allowing to train a 7B model on a consumer GPU
- Performance comparable to full parameter training and slightly better than LoRA
- Slower, takes twice as long as a full parameter training and LoRA
I hope you enjoyed this guide! Share your GaLore experiences or questions in the comments below or reach out on Twitter.