QLORA with Llama 2

A step-by-step guide to bitsandbytes on a GPU machine

fishingalone
6 min readOct 10, 2023

--

Llama 2 has been out for months. Still haven’t tried it due to limited GPU resource? This guide will walk you through how to run inference & fine-tune with Llama2 on an old GPU.

Run inference with quantized Llama 2

The 1st step is gain access to the model. Visit Meta website and accept the license and user policy. Then visit meta-llama (Meta Llama 2) and request access to the model weights on huggingface. Copy your huggingface hub token as explained here.

The 2nd step is to set up the Python environment as detailed in the appendix.

The 3rd step is to download the model+tokenizer and load the model into GPU in 4/8/16 bit quantized format.

# Code from taprosoft's github: https://github.com/taprosoft/llm_finetuning/blob/efa6df245fee4faf27206d84802d8f58d4b6e77d/inference.py#L20
from transformers import (AutoModelForCausalLM,
BitsAndBytesConfig,
LlamaTokenizer)
import torch
import os

os.environ["HUGGING_FACE_HUB_TOKEN"] = "{{your_huggingface_hub_token}}"

def load_hf_model(
base_model,
mode=8,
gradient_checkpointing=False,
device_map="auto",
):
kwargs = {"device_map": device_map}
if mode == 8:
kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=0.0,
)
elif mode == 4:
kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
elif mode == 16:
kwargs["torch_dtype"] = torch.float16

model = AutoModelForCausalLM.from_pretrained(base_model, **kwargs)

# setup tokenizer
tokenizer = LlamaTokenizer.from_pretrained(base_model)

tokenizer.pad_token_id = 0 # unk. we want this to be different from the eos token
tokenizer.padding_side = "left" # Allow batched inference
return model, tokenizer

model, tokenizer = load_hf_model(
"meta-llama/Llama-2-7b-chat-hf",
mode=4,
gradient_checkpointing=False,
device_map='auto')

The last step is try inference.

from transformers import GenerationConfig

sequences = ["<s>[INST] <<SYS>> You are a helpful assistant. <</SYS>>\
Extract the place names from the given sentence. [\INST]\n\
The capital of the United States is Washington D.C."]

inputs = tokenizer(sequences, padding=True, return_tensors="pt").to('cuda')

outputs = model.generate(
**inputs,
generation_config=GenerationConfig(
do_sample=True,
max_new_tokens=512,
top_p=0.99,
temperature=1e-8,
)
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Fine tune quantized Llama 2

Bitsandbytes falls into a general category of quantization schema called Post-training quantization (PTQ). In theory, there is no way to continue training such a quantized model. LORA offers a rescue by adding a low-rank matrix (called “adapter”) to each linear layer in Llama2. During fine-tuning, we only tune the adapter weights while keeping the original weights untouched.

Lora (credit: paper)

The 1st step is to load the dataset. The dataset has only 2 columns: input & output.

from datasets import load_dataset
dataset = load_dataset('csv', data_files="./attributes.csv")

The 2nd step is to set up the data collator, which formats, tokenizes and groups the fine tune dataset into batches.

# Code from taprosoft's github
from dataclasses import dataclass, field
import transformers
import torch
import copy
from typing import Dict, Sequence
from torch.nn.utils.rnn import pad_sequence

IGNORE_INDEX = -100

@dataclass
class DataCollatorForCausalLM(object):
tokenizer: transformers.PreTrainedTokenizer
source_max_len: int
target_max_len: int
train_on_source: bool
predict_with_generate: bool

def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
# Extract elements
sources = [f"{self.tokenizer.bos_token}{example['input']}" for example in instances]
targets = [f"{example['output']}{self.tokenizer.eos_token}" for example in instances]
# Tokenize
tokenized_sources_with_prompt = self.tokenizer(
sources,
max_length=self.source_max_len,
truncation=True,
add_special_tokens=False,
)
tokenized_targets = self.tokenizer(
targets,
max_length=self.target_max_len,
truncation=True,
add_special_tokens=False,
)
# Build the input and labels for causal LM
input_ids = []
labels = []
for tokenized_source, tokenized_target in zip(
tokenized_sources_with_prompt['input_ids'],
tokenized_targets['input_ids']
):
if not self.predict_with_generate:
input_ids.append(torch.tensor(tokenized_source + tokenized_target))
if not self.train_on_source:
labels.append(
torch.tensor([IGNORE_INDEX for _ in range(len(tokenized_source))] + copy.deepcopy(tokenized_target))
)
else:
labels.append(torch.tensor(copy.deepcopy(tokenized_source + tokenized_target)))
else:
input_ids.append(torch.tensor(tokenized_source))
# Apply padding
input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
labels = pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX) if not self.predict_with_generate else None
data_dict = {
'input_ids': input_ids,
'attention_mask':input_ids.ne(self.tokenizer.pad_token_id),
}
if labels is not None:
data_dict['labels'] = labels
return data_dict

data_collator = DataCollatorForCausalLM(
tokenizer=tokenizer,
source_max_len=280,
target_max_len=512,
train_on_source=False,
predict_with_generate=False,
)

The 3rd step is to write several helper methods.

import bitsandbytes as bnb
import torch
import peft

# COPIED FROM https://github.com/artidoro/qlora/blob/main/qlora.py
def print_trainable_parameters(model, use_4bit=False):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
num_params = param.numel()
# if using DS Zero 3 and the weights are initialized empty
if num_params == 0 and hasattr(param, "ds_numel"):
num_params = param.ds_numel

all_param += num_params
if param.requires_grad:
trainable_params += num_params
if use_4bit:
trainable_params /= 2
print(
f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
)


# COPIED FROM https://github.com/artidoro/qlora/blob/main/qlora.py
def find_all_linear_names(model):
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])

if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)


def create_peft_model(model, gradient_checkpointing=True, bf16=True):
from peft import (
get_peft_model,
LoraConfig,
TaskType,
prepare_model_for_kbit_training,
)
from peft.tuners.lora import LoraLayer

# prepare int-4 model for training
model = prepare_model_for_kbit_training(
model, use_gradient_checkpointing=gradient_checkpointing
)
if gradient_checkpointing:
model.gradient_checkpointing_enable()

# get lora target modules
modules = find_all_linear_names(model)
print(f"Found {len(modules)} modules to quantize: {modules}")

peft_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=modules,
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, peft_config)

# pre-process the model by upcasting the layer norms in float 32 for
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if bf16:
module = module.to(torch.bfloat16)
if "norm" in name:
module = module.to(torch.float32)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
if bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)

model.print_trainable_parameters()
return model

The last step to start fine tuning.

from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
set_seed,
BitsAndBytesConfig,
Trainer,
TrainingArguments,
)

# create peft config
model = create_peft_model(
model, gradient_checkpointing=False, bf16=False
)

# Define training args
output_dir = "."
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=1,
bf16=False, # Use BF16 if available
learning_rate=2e-4,
num_train_epochs=3,
optim="paged_adamw_8bit", #"adamw_torch" if not mode = 4,8
gradient_checkpointing=False,
# logging strategies
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=10,
remove_unused_columns=False,
)

# Create Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset['train'],
data_collator=data_collator,
)

# Start training
trainer.train()

# Save the model
trainer.save_model('./pretrained_model')

The parameter per_device_train_batch_size is critical since it determines the amount of memory consumption. The bigger per_device_train_batch_size is, the faster the training will finish; the smaller the lower probability of an OOM error.

Huggingface trainer saves a checkpoint in a directory named “checkpoint-xxx” every 500 (default) steps. We can run

trainer.train(resume_from_checkpoint='./checkpoint-500')

to resume an interrupted training process.

Notice that the trainer here saves only LORA adapter’s weights, not the original Llama 2 weights. How do we run inference with the fine-tuned model?

from peft import PeftModel

# model is the quantized model loaded using load_hf_model in a previous step
model = PeftModel.from_pretrained(model, './pretrained_model')

Don’t get distracted by peft’s merge_and_unload() method during the process. It would be great if we can load model+adapter in one stroke but it simply does not work as of the writing of this article.

If you like this post…

Support me with a coffee!

Appendix. Install bitsandbytes on an old GPU machine

It takes me a while to figure out how to make bitsandbytes work on my machine. The instructions in the huggingface blog are too sketchy.

Before we begin, bitsandbytes has 2 pre-requisites: i. the operating system has to be Linux and ii. the GPU has to be released after Dec 2017, for example V100 in my case.

The 1st step is to install pytorch (GPU) with the right CUDA. Find nvidia version by running

nvidia-smi

in command line. In my case, it’s 11.4. Install the corresponding 11.x (if your nvidia-smi returns 12.1, use 12.x) CUDA version of pytorch.

pip install torch==2.0.1 --extra-index-url https://download.pytorch.org/whl/cu117

The 2nd step is to install the huggingface libraries.

pip install transformers
pip install peft
pip install accelerate

The 3rd step is to install CUDA 11.7 with help from the library author.

wget https://raw.githubusercontent.com/TimDettmers/bitsandbytes/main/cuda_install.sh
# Syntax cuda_install CUDA_VERSION INSTALL_PREFIX EXPORT_TO_BASH
# CUDA_VERSION in {110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 121}
# EXPORT_TO_BASH in {0, 1} with 0=False and 1=True

# For example, the following installs CUDA 11.7 to ~/local/cuda-11.7 and exports the path to your .bashrc
bash cuda_install.sh 117 ~/local 1

The last step is to compile and install bitsandbytes.

git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes
# pytorch 2.0 requires cuda 11.7
CUDA_HOME=~/local/cuda-11.7 CUDA_VERSION=117 make cuda11x_nomatmul
python setup.py install

We can check the installation errors by running

python -m bitsandbytes

--

--