A Step-by-Step Guide to Fine-Tuning Gemini for Question Answering
Imagine a foundation model that doesn’t just generate text, but excels at your specific use case, like questioning and answering across your enterprise documents and data. This is the power of fine-tuned language models.
In this blog, we show how to enhance the capabilities of Gemini 1.5 Flash by building a robust Q&A system fine-tuned on the Stanford Question Answering Dataset (SQuAD 1.1). We’ll walk you through the entire process, from data preparation to evaluation, revealing the techniques and best practices for fine-tuning Gemini for your own GenAI applications.
Let’s get started! (You can find all the code in this notebook.)
What is fine-tuning?
To improve the performance of a foundation model on a specific task, like question answering, we can leverage a technique called “fine-tuning.” This involves training the model on examples with known inputs and outputs (labeled examples); we call this supervised fine tuning (SFT). There are two main approaches that are often supervised:
- Full fine-tuning: Updates all of the model’s parameters. This method requires significant computational resources, and is not often used in practice.
- Parameter-Efficient Fine-Tuning (PEFT): Freezes the original model and only updates a small set of new parameters. Updating fewer parameters means it is more efficient and faster, making it ideal for working with large models and limited resources.
In this post, when we talk about fine-tuning, we’re referring to PEFT and then specifically LoRA. LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models by training a smaller set of parameters, making it more efficient and less memory-intensive.
When to use fine-tuning?
PEFT is a powerful technique for enhancing foundation models when you have a specific task and annotated data. It excels in areas like:
- Domain expertise: Making your model more specialized in areas like law or medicine.
- Format customization: Tailoring outputs to specific structures, like JSON output.
- Task optimization: Improving performance on tasks like summarization.
- Behavior Control: Guiding the model’s response style (e.g., concise vs. detailed).
PEFT is efficient, often requiring less data than other methods, and leads to easier model interaction with shorter prompts. However, because it modifies the model’s underlying weights, it’s not ideal for tasks with dynamic or evolving information — think real-time weather updates. In these cases, you might want to use something like retrieval augmented generation (RAG) or function calling, which can access and process real-time data.
With these considerations in mind, how do you choose the best approach? It’s important to understand that the optimal path depends on your unique needs, resources, and objectives for your use case. These techniques are not mutually exclusive and can often be combined for even greater performance. Let’s explore a framework to that can help guide your decision:
Want to learn more about fine-tuning and when to use it? Read this blog post.
Data and use case
As discussed before, for SFT you need to have a specific task and annotation that represents your use case. In this example, we’re using the Stanford Question Answering Dataset (SQuAD 1.1). This popular reading comprehension dataset consists of questions posed on Wikipedia articles (context), where the answer is a segment of text (a span) from the corresponding passage. Sometimes, the questions might even be unanswerable, which adds an extra layer of challenge.
Fine-tuning isn’t only about boosting performance on specific tasks; it’s also a powerful tool for controlling output behavior as discussed before. For example, consider the following:
Context: The Panthers beat the Seattle Seahawks in the divisional round, running up a 31–0 halftime lead and then holding off a furious second half comeback attempt to win 31–24, avenging their elimination from a year earlier. The Panthers then blew out the Arizona Cardinals in the NFC Championship Game, 49–15, racking up 487 yards and forcing seven turnovers.
Question: Who did Carolina beat in the divisional round?
Gemini 1.5 Flash response: The Panthers beat the Seattle Seahawks in the divisional round.
Answer SQuAD dataset: seattle seahawks
See the difference? Both answers are correct, but the answer from the SQuAD dataset is a more concise and focused response. Fine-tuning can help you shape your model’s output to meet your specific needs.
Data preparation
Pre-processing is a crucial step when fine-tuning, and it’s more than just a quick cleanup. Research has shown that one of the most crucial pre-processing steps is deduplication, which involves identifying and removing duplicate data points. Deduplication removes redundant data points, ensuring your model learns from diverse examples.
But that’s not all! When dealing with text data, we also need to consider how to handle those pesky inconsistencies like capitalization, extra whitespace, and punctuation. Inconsistent data can confuse your model or mess up your evaluation, so we need to standardize it. For this example, we’ll remove extra whitespace and lowercase all answers to ensure consistency.
In this example we will remove extra white space and turn the answers to lowercase.
def normalize_answer(s):
"""Lower text and remove extra whitespace, but preserve newlines."""
def white_space_fix(text):
return ' '.join(text.split()) # Splits by any whitespace, including \n
def lower(text):
return text.lower()
return white_space_fix(lower(s))
Model selection
When fine-tuning Gemini with Vertex AI, you have a choice of powerful models, each optimized for different needs:
- Gemini 1.5 Pro: Google’s top-performing model for general-purpose use cases. If you need the best possible accuracy across a wide range of tasks, this is your go-to.
- Gemini 1.5 Flash: Designed for speed and efficiency. Choose this model when you need fast responses and cost-effective performance.
When choosing a Gemini model, consider:
- Functionality: Start with the model that best fits your needs. If you need high accuracy and complex reasoning, use Gemini Pro. If latency and cost are more important, try Gemini Flash.
- Efficient Fine-tuning: Before fine-tuning a larger model, test your data on a smaller one like Gemini Flash. This helps ensure your data improves accuracy before investing in fine-tuning a larger model.
In this example, we’ll be using
base_model = 'gemini-1.5-flash-002'
Establishing a baseline
Before fine-tuning your language model, establish a performance baseline. This means evaluating the foundation model on your data to understand its initial capabilities. In this example we are using computation-based metrics to measure performance, comparing the model’s output to a reference. For this example, we’ll use:
- Exact Match (EM): Measures the percentage of perfect matches.
- F1 Score: Considers both precision and recall for a more nuanced view.
It’s best if you use a combination of metrics for a holistic understanding of your model’s strengths and weaknesses. You can also have a look at the evaluation capabilities that are offered on the Vertex AI platform to help guide your fine-tuning strategy.
Dataset format
When fine-tuning Gemini, your training data needs to be in a specific format: a JSON lines file where each line is a separate example. Make sure you store your file in a Google Cloud Storage (GCS) bucket. Each line in the JSONL file must adhere to the following schema:
{
"contents":[
{
"role":"user", # This indicates input content
"parts":[
{
"text":"Here goes the question and context"
}
]
},
{
"role":"model", # This indicates target content
"parts":[ # text only
{
"text":"Here goes the model response"
}
]
}
# ... repeat "user", "model" for multi turns.
]
}
In this structure:
- “contents” is the content.
- Each object within “contents” has a specified role: “user” for the user’s input and “model” for the desired model output.
- “parts” contain the actual data, as in model input (question, context and instruction) and the model response (answer).
Tip: For multi-turn conversations, simply repeat the “user” and “model” structure within “contents”.
You can choose to let the system do the train and validation split for you. Then you only need to provide a train_dataset (see code below) or you can decide to split the data into train and validation yourself.
Start the fine-tuning job.
Next you can start the fine-tuning job using the Vertex AI SDK for Python. Here’s how you launch a fine-tuning job with the sft.train() method:
from vertexai.preview.tuning import sft
tuned_model_display_name = "fine-tuning-gemini-flash-qa-v01"
sft_tuning_job = sft.train(
source_model=base_model,
train_dataset=f"""{BUCKET_URI}/squad_train.jsonl""",
# # Optional:
validation_dataset=f"""{BUCKET_URI}/squad_validation.jsonl""",
tuned_model_display_name=tuned_model_display_name,
)
Key Parameters:
- source_model: The starting point for your fine-tuning journey. Specify the pre-trained Gemini model version you’ll build upon.
- train_dataset: The fuel for your model’s learning. Provide the path to your training data in JSONL format.
- validation_dataset (Optional): A valuable checkpoint. This dataset allows you to evaluate the model’s performance during training.
- tuned_model_display_name: Give your creation a memorable name! This sets the display name for your fine-tuned model.
We are leaving the hyperparameters as their defaults. You can experiment with optional parameters like rank and learning rate to optimize performance.
Evaluating the Fine-tuned Model
After training, we evaluated the fine-tuned model on the same test dataset used for the baseline. This allowed us to directly compare performance and quantify the improvements gained through fine-tuning.
Tuning progress
During the tuning process you want to keep an eye on the training and validation metrics, that you can find in the Google Cloud Console after starting your training job, as shown below.
- Total loss measures the difference between predicted and actual values. A decreasing training loss indicates the model is learning. Critically, observe the validation loss as well. A significantly higher validation loss than training loss suggests overfitting.
- Fraction of correct next step predictions measures the model’s accuracy in predicting the next item in a sequence. This metric should increase over time, reflecting the model’s growing accuracy in sequential prediction.
This example takes approximately 20 minutes to complete; however, the actual completion time will vary depending on your specific use case and dataset size.
Evaluation
Evaluating the performance of fine-tuned language models is crucial for understanding its performance, checkpoint selection and hyperparameter optimization. Evaluation can be challenging for generative models, as their outputs are often open-ended and creative. To gain a holistic understanding of performance, it’s best to combine different evaluation approaches, primarily utilizing a blend of auto-metrics and model-based evaluation, potentially calibrated with human evaluation.
As mentioned above in this example we use EM and F1. As you can see, fine-tuning Gemini 1.5 Flash for this specific Q&A use case gives us quite a significant performance increase on F1 and EM. Of course you can leverage prompt engineering and RAG to also improve the performance of the baseline model.
Using the fine-tuned model
When the fine-tuning job is finished, your model is ready for use! You only need to load the fine-tuned model.
# tuned model endpoint name
tuned_model_endpoint_name = sft_tuning_job.tuned_model_endpoint_name
tuned_genai_model = GenerativeModel(tuned_model_endpoint_name)
# Test with the loaded model.
print("***Testing***")
print(tuned_genai_model.generate_content(contents=prompt))
Conclusion
You’ve now got the know-how to fine-tune Gemini, turning this powerful foundation model into a specialized Q&A machine for your information. From prepping your data to evaluating your model’s performance, we’ve covered the essential steps for harnessing the power of Vertex AI. You can get started by using this notebook. In the same repo you will also find some other great notebooks on fine-tuning.
With a bit of practice (and maybe a few extra shots of espresso ☕), you’ll be fine-tune Gemini for all sorts of exciting applications.