Initiating Gemma Fine-Tuning on Google Colab: A Comprehensive Guide

Shital Nandre
6 min readFeb 25, 2024

--

Unlock the potential of GEMMA, Google’s cutting-edge language model, with this comprehensive tutorial on fine-tuning. Discover how to harness the power of qLora and Supervised Fine-Tuning to adapt GEMMA to specific tasks, whether it’s code generation or natural language understanding. Follow step-by-step instructions to load the model, prepare datasets, apply Lora for efficient training, and seamlessly integrate with existing tools like Hugging Face’s ecosystem. Elevate your AI capabilities with GEMMA and revolutionize your projects with state-of-the-art language processing.

Gemma: Empowering Responsible AI Innovation Worldwide

Gemma, a new family of lightweight, state-of-the-art open models, draws inspiration from Google’s Gemini models and is developed by Google DeepMind and other teams. Available worldwide, Gemma comes in two sizes: Gemma 2B and Gemma 7B, each with pre-trained and instruction-tuned variants. Alongside model weights, a Responsible Generative AI Toolkit aids in creating safer AI applications. Toolchains for inference and supervised fine-tuning are provided for JAX, PyTorch, and TensorFlow, with integration into popular platforms like Hugging Face and NVIDIA NeMo. Gemma models offer easy deployment on various platforms, including Google Cloud, with optimization for industry-leading performance across hardware platforms like NVIDIA GPUs and Google Cloud TPUs.

Prerequisites

Ensure access to suitable GPU resources: Gemma-2B can be fine-tuned on a T4 GPU (available on free Google Colab), while Gemma-7B requires an A100 GPU.
Install necessary Python packages by executing the provided commands.
Start by verifying GPU detection to proceed with the fine-tuning process.

Getting Started

!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0

Checking GPU

!nvidia-smi

Essential Libraries and Tools

User
import json
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
pipeline,
logging,
)
from trl import SFTTrainer

Logging into Hugging Face Hub

Log in to the Hugging Face Model Hub using your credentials:

notebook_login()

Load Gemma model for causal language modeling

We utilize the ‘google/gemma-7b-it’ model for our task, specifically chosen to align with the T4 GPU available on Google Colab.

model_id = "google/gemma-7b-it"
# model_id = "google/gemma-7b"
# model_id = "google/gemma-2b-it"
# model_id = "google/gemma-2b"

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

Let's Load the Dataset

In this tutorial, we’ll fine-tune Mistral 2B Instruct for creating code. We’ll use a dataset curated by TokenBender, known for its clear instructions. This dataset is structured in a straightforward way, making it perfect for our task. The dataset structure should resemble the following:

{
"instruction": "Create a function to calculate the sum of a sequence of integers.",
"input": "[1, 2, 3, 4, 5]",
"output": "# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum"
}

now let's load the dataset using huggingfaces datasets library

# Load your dataset (replace 'your_dataset_name' and 'split_name' with your actual dataset information)
# dataset = load_dataset("your_dataset_name", split="split_name")
dataset = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split="train")

Dataset Formatting

Now, let’s format the dataset according to the specified Gemma instruction format.

<start_of_turn>user Can you provide instructions for making a classic spaghetti carbonara? <end_of_turn>
<start_of_turn>model Certainly! Here's a simple recipe: Cook spaghetti until al dente, fry pancetta until crispy, whisk eggs and Parmesan cheese, then mix with cooked spaghetti and pancetta. Serve immediately with freshly ground black pepper!<end_of_turn>

Use the provided code snippet to prepare your dataset and generate a JSONL file formatted correctly.

def generate_prompt(data_point):
"""Gen. input text based on a prompt, task instruction, (context info.), and answer

:param data_point: dict: Data point
:return: dict: tokenzed prompt
"""
prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
'appropriately completes the request.\n\n'
# Samples with additional context into.
if data_point['input']:
text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
# Without
else:
text = f"""<start_of_turn>user {prefix_text} {data_point["instruction"]} <end_of_turn>\n<start_of_turn>model{data_point["output"]} <end_of_turn>"""
return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)

We must tokenize our data to ensure compatibility with the model’s understanding.

dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Split the dataset into 90% for training and 10% for testing

dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

After formatting, the dataset should resemble the following structure:

{
"text":"<start_of_turn>user Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] <end_of_turn>
<start_of_turn>model # Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum <end_of_turn>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
sequence: sum += num return sum",
"prompt":"<start_of_turn>user Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] <end_of_turn>
<start_of_turn>model # Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum <end_of_turn>"

}

When employing the Supervised Fine-tuning Trainer (SFT) for fine-tuning, only the ‘text’ column of the dataset will be passed in for training.

Configuring Model Parameters and Implementing LoRA for Fine-Tuning

We’re configuring various parameters for our fine-tuning process, including QLoRA and bitsandbytes settings, along with specifying training arguments. Additionally, we’re leveraging the power of PeftModel to apply low-rank adapters (LoRA) using the `get_peft_model` utility function and `prepare_model_for_kbit_training` method from PEFT.

from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=64,
lora_alpha=32,
target_modules=['o_proj', 'q_proj', 'up_proj', 'v_proj', 'k_proj', 'down_proj', 'gate_proj'],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Calculating the number of trainable parameters

trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

Fine-Tuning with qLora Using SFTTrainer from trl Library

We’re prepared to fine-tune our model using qLora. This tutorial utilizes the SFTTrainer from the trl library for supervised fine-tuning. Make sure you have installed the trl library as outlined in the prerequisites.

#new code using SFTTrainer
import transformers

from trl import SFTTrainer

tokenizer.pad_token = tokenizer.eos_token
torch.cuda.empty_cache()

trainer = SFTTrainer(
model=model,
train_dataset=train_data,
eval_dataset=test_data,
dataset_text_field="prompt",
peft_config=lora_config,
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=0.03,
max_steps=100,
learning_rate=2e-4,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit",
save_strategy="epoch",
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

Initiating the Training Process

# Start the training process
trainer.train()

new_model = "gemma-Code-Instruct-Finetune-test" #Name of the model you will be pushing to huggingface model hub
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)

Merging and Sharing the Fine-Tuned Model

After completing the fine-tuning process, you have the option to merge the model with LoRA weights or share it with the Hugging Face Model Hub. This step is discretionary and contingent upon your particular use case.

# Merge the model with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Testing the Merged Model

def get_completion(query: str, model, tokenizer) -> str:
device = "cuda:0"
prompt_template = """
<start_of_turn>user
Below is an instruction that describes a task. Write a response that appropriately completes the request.
{query}
<end_of_turn>\\n<start_of_turn>model

"""
prompt = prompt_template.format(query=query)
encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
# decoded = tokenizer.batch_decode(generated_ids)
decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
return (decoded)

result = get_completion(query="implement a function to check if a number is prime in Python", model=merged_model, tokenizer=tokenizer)
print(result)

Congratulations! You’ve completed the fine-tuning process for Gemma Instruct, enabling it for code generation tasks. This process can be adapted for a wide range of natural language understanding and generation tasks. Continue exploring and experimenting with Gemma to unleash its full potential for your projects. Happy Fine-Tuning!

--

--

No responses yet