Memory-efficient LLM Training with GaLore

A novel approach for full-parameter finetuning. How to use it and what to expect.

Geronimo
5 min readMar 24, 2024

Training Large Language Models (LLMs), even those with “only” 7 billion parameters, is a computationally intensive task. Traditionally, this level of training required resources beyond the reach of most individual enthusiasts. To bridge this gap, parameter-efficient methods such as low-rank adaptation (LoRA) have emerged, enabling fine-tuning of substantial models on consumer GPUs.

GaLore

Enter GaLore, a novel approach detailed in the publication GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.
GaLore reduces VRAM requirements not by decreasing the number of parameters directly but by optimizing how these parameters are trained.

GaLore focuses on two primary strategies:

  1. Gradient Low-Rank Projection: GaLore shifts away from handling the full, high-dimensional gradients of weight matrices. Instead, it projects these gradients onto a low-rank space, significantly reducing the computational load while retaining essential information for training.
  2. Per-Layer Weight Updates: Unlike the conventional method where an optimizer updates all layers simultaneously after backpropagation, GaLore implements updates on a per-layer basis during backpropagation. This approach further reduces the memory footprint throughout the training process.

Just like LoRA, GaLore allows us to finetune a 7B model on a consumer GPU with 24 GB VRAM. The performance of the resulting model is comparable to a full parameter finetune and appears to outperform LoRA.

This guide shows you how to use GaLore with the 🤗 Trainer and what to expect.

Prerequisites

Install the GaLore optimizer, either with pip install or download from GaLore GitHub repo.

pip install galore-torch

Make sure to use the latest Hugging Face packages. The following versions have been used for this guide.

accelerate==0.28.0
bitsandbytes==0.43.0
datasets==2.18.0
transformers==4.39.1
trl==0.8.1
torch==2.2.1

Load model and dataset

If you are familiar with finetuning models with the Hugging Face suite, this will look familiar to you and you will probably have your own code and routines setup that might look different.

The following is a simple example using TRL’s SFTTrainer (a subclass of Trainer) to 7B base model of the llama2 family, set it up for ChatML and load the Open Assistant 2 dataset, a collection of human curated and rated conversations. The code below runs on a 24 GB VRAM GPU like a RTX 3090/4090.

from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import setup_chat_format
from datasets import load_dataset
import torch

modelpath = "meta-llama/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
modelpath,
torch_dtype = torch.bfloat16,
attn_implementation = "flash_attention_2",
device_map = "auto",
use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)

model, tokenizer = setup_chat_format(model, tokenizer)
if tokenizer.pad_token in [None, tokenizer.eos_token]:
tokenizer.pad_token = tokenizer.unk_token

dataset = load_dataset("g-ronimo/oasst2_top4k_en")

Setup 🤗 Trainer for GaLore

The GaLore optimizer comes with a few hyperparameters we need to pass on to the 🤗 Trainer.

from transformers import TrainingArguments

# GaLore hyperparameters
rank = 1024
update_proj_gap = 200
scale = 2

training_arguments = TrainingArguments(
output_dir = f"out",
evaluation_strategy = "steps",
label_names = ["labels"],
per_device_train_batch_size = 16,
save_steps = 250,
eval_steps = 250,
logging_steps = 1,
learning_rate = 1e-5,
num_train_epochs = 3,
lr_scheduler_type = "constant",
gradient_checkpointing = True,
optim = "galore_adamw_8bit_layerwise",
optim_target_modules = ["attn", "mlp"],
optim_args = f"rank={rank}, update_proj_gap={update_proj_gap}, scale={scale}",
)

General training hyperparameters

  • per_device_train_batch_sizeWe train with a batch size of 16, this is the maximum number samples that will fit into 24 GB of VRAM.
  • save_steps, eval_steps The dataset contains 4000 samples, with a batch size of 16 one epoch is completes every 250 steps. We therefore set the trainer up to evaluate and save after each epoch.
  • learning_rate Something around 10e–5 is what is typically used to finetune a 7B model.
  • gradient_checkpointing Saves memory and allows for a bigger batch sizes but slows down training.

GaLore hyperparameters

  • optim Tells the Trainer to use GaLore as optimizer. _layerwise additionally specifies that the updates are made per layer which saves VRAM.
  • optim_target_modules Specifies the layers targeted by GaLore, primarily the linear layers identified with attn or mlp in their names.
  • rank The rank of the projection matrices. Similar to LoRA, the higher the rank the more closely the finetuning will resemble a full parameter finetune. The GaLore authors recommend 1024 for a 7B model.
  • update_proj_gap The number of steps after which the projections are updated. The update is an expensive step and takes around 15 minutes for a 7B model. Defines the interval for updating projections, with a suggested range between 50 and 1000 steps.
  • scale A scale factor akin to LoRA’s alpha, adjusting the update strength. After trying a few values I found scale=2 to most closely resemble a classic full-parameter finetune.

Start training

With the optimizer set up we can start training with the Hugging Face Trainer.

from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset["train"],
eval_dataset = dataset['test'],
data_collator = DataCollatorForCompletionOnlyLM(
instruction_template = "<|im_start|>user",
response_template = "<|im_start|>assistant",
tokenizer = tokenizer,
mlm = False),
max_seq_length = 256,
dataset_kwargs = dict(add_special_tokens = False),
args = training_arguments,
)

trainer.train()

All the code above can be found in a notebook here.

GaLore versus full parameter finetune

The training loss with the given hyperparameters closely mimics the trajectory of a full parameter tuning, indicating that the GaLore layerwise approach is indeed equivalent.

learning rate 1e-5. batch size 16. GaLore rank 1024, scale 2, update_proj_gap 200.

Also when it comes to the numbers, the model trained with GaLore scores very similar to a full parameter finetune.

eval harness used for evaluation.

Memory and Speed

While GaLore introduces a VRAM savings of approximately 15 GB, enabling its operation on a single RTX 3090/4090 GPU, it does takes longer to train due to periodic projection updates.

Memory usage. 2x 3090 GPUs
Training time. Full parameter training: ~58 minutes. GaLore: ~130 minutes

GaLore vs. LoRA

GaLore stands as another parameter-efficient training method, alongside LoRA. Both approaches yield similar outcomes in terms of loss, performance benchmarks, and VRAM efficiency, with GaLore slightly edging out in the benchmarks.

LoRA with all linear modules targeted, rank 64, alpha 16.

Wrap up

  • GaLore is a novel approach to approximate full parameter training
  • Saves VRAM, allowing to train a 7B model on a consumer GPU
  • Performance comparable to full parameter training and slightly better than LoRA
  • Slower, takes twice as long as a full parameter training and LoRA

I hope you enjoyed this guide! Share your GaLore experiences or questions in the comments below or reach out on Twitter.

--

--