Mastering Precision: A Guide to Fine-tuning with Google Gemma’s Model

8 min readMar 11, 2024

Fine-tuning LLM model using Quantization, LoRA, QLoRA.

Before delving into the complexity of fine-tuning with Google’s Gemma Model, let’s first familiarize ourselves with some fundamental terminologies. Understanding these basics way for a smoother journey as we explore the world of model optimization and customization.

Steps involved are:

Quantization.
Parameter Efficient Fine Tuning.

What is Quantization?

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy , and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.

The two most common quantization cases are float32 -> float16 and float32 -> int8.

What is 4-bit Quantization?

Quantization in the context of deep learning is the process of constraining the number of bits that represent the weights and biases of the model.

In 4-bit quantization, each weight or bias is represented using only 4 bits as opposed to the typical 32 bits used in single-precision floating-point format (float32).

Single Precision: Single Precision is a format proposed by IEEE for the representation of floating-point numbers. It occupies 32 bits in computer memory.

Double Precision: Double Precision is also a format given by IEEE for the representation of the floating-point number. It occupies 64 bits in computer memory.

Why does it use less GPU Memory?

The primary advantage of using 4-bit quantization is the reduction in model size and memory usage. Here’s a simple explanation:

A float32 number takes up 32 bits of memory.
A 4-bit quantized number takes up only 4 bits of memory.

So, theoretically, you can fit 8 times more 4-bit quantized numbers into the same memory space as float32 numbers. This allows you to load larger models into the GPU memory or use smaller GPUs that might not have been able to handle the model otherwise.

The amount of memory used by an integer in a computer system is directly related to the number of bits used to represent that integer.

For detailed explanation about 4-bit quantization visit.

Types of Quantization Schemes

Symmetric quantization: In this case, the zero-point is zero — i.e. 0.0 of the floating point range is the same as 0 in the quantized range. Typically, this is more efficient to compute at runtime but may result in lower accuracy if the floating point range is unequally distributed around the floating point 0.0.

Example: Let’s say we wish to map the floating point range [0.0 .. 1000.0] to the quantized range [0 .. 255]. The range [0 .. 255] is the set of values that can fit in an unsigned 8-bit integer.

To perform this transformation, we want to rescale the floating point range so that the following is true:

Floating point 0.0 = Quantized 0

Floating point 1000.0 = Quantized 255

This is called symmetric quantization because the floating point 0.0 is quantized 0.

Hence, we define a scale, which is equal to

where,

In this case, scale = 3.9215

To convert from a floating point value to a quantized value, we can simply divide the floating point value by the scale. For example, the floating point value 500.0 corresponds to the quantized value

2. Affine (or asymmetric) quantization: This is the one that has a zero-point that is non-zero in value.

Example: Let’s say we wish to map the floating point range [-20.0 .. 1000.0] to the quantized range [0 .. 255].

In this case, we have a different scaling factor since our xmin is different.

Let’s see what the floating point number 0.0 is represented by in the quantized range if we apply the scaling factor to 0.0.

Well, this doesn’t quite seem right since, according to the diagram above, we would have expected the floating point value -20.0 to map to the quantized value 0.

The zero-point acts as a bias for shifting the scaled floating point value and corresponds to the value in the quantized range that represents the floating point value 0.0. In our case, the zero point is the negative of the scaled floating point representation of -20.0, which is -(-5) = 5. The zero point is always the negative of the representation of the minimum floating point value since the minimum will always be negative or zero.

Modes of Quantization

Post-training Quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy and data.
Quantization Aware Training emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment.

Note: Post-training quantization (PTQ) is a quantization technique where the model is quantized after it has been trained. Quantization-aware training (QAT) is a fine-tuning of the PTQ model, where the model is further trained with quantization in mind.

What is PEFT Finetuning?

PEFT Finetuning stands for Parameter Efficient Fine Tuning, a suite of techniques designed to fine-tune and train models more efficiently than traditional methods. By reducing the number of trainable parameters in a neural network, PEFT techniques, including Prefix Tuning, P-tuning, LoRA, and others, enhance training efficiency. LoRA, in particular, has gained prominence for its effectiveness and has spawned various adaptations like QLoRA and LongLoRA, each tailored for specific applications.

Low Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA)

LoRA, a cornerstone of PEFT, operates by introducing new, trainable parameters that adapt the model without increasing its overall parameter count. This method, akin to an adapter approach, ensures the model size remains unchanged while still benefiting from parameter-efficient fine-tuning.

It is based on the hypothesis that the intrinsic rank of the weight matrices in a large language model is low. Researchers have shown that low-rank approximations of the weight matrices in large language models can achieve comparable performance to the original weight matrices on a variety of tasks. This suggests that most of the information in the weight matrices can be captured by a small number of parameters.

QLoRA is an advanced technique designed for parameter-efficient fine-tuning of large pre-trained language models (LLMs). It builds upon the principles of Low-Rank Adaptation (LoRA) but introduces additional quantization to enhance parameter efficiency further.

QLoRA leverages a frozen, 4-bit quantized pretrained language model and backpropagates the gradients into Low Rank Adapters (LoRA). This combination seems to optimize both computation (by using low-bit quantization) and the number of parameters (using low-rank structures). The figure below from the paper shows different finetuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.

Installing Required packages

!pip3 install -q -U bitsandbytes peft trl accelerate datasets transformers

bitsandbytes: is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.
pert: Parametric efficient fine-tuning (PEFT) is a methodology used in transfer learning to efficiently fine-tune large pre-trained models without modifying most of their original parameters.
trl: TRL is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning.
accelerate: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code, making training and inference at scale made simple, efficient and adaptable.
datasets: Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
transformers: Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

Importing required libraries

import os

import torch
import transformers
from datasets import load_dataset
from peft import LoraConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    GemmaTokenizer,
)
from trl import SFTTrainer

Getting API key

os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

Loading gemma model

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

Tokenizing Data

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ["HF_TOKEN"])
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"": 0},
    token=os.environ["HF_TOKEN"],
)

Configuring LoRA

os.environ["WANDB_DISABLED"] = "false"

lora_config = LoraConfig(
    r=8,
    target_modules=[
        "q_proj",
        "o_proj",
        "k_proj",
        "v_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    task_type="CAUSAL_LM",
)

Loading data for training

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Training Model

def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

trainer.train()

Inferencing

text = "Quote: A woman is like a tea bag;"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))