Level Up your RAG: Tuning Embeddings on Vertex AI
Premises: This blog post is intended for demonstration only. The context, the code and image samples provided are intended for educational purposes and should not be considered within a commercial context.
Despite recent advancements, Large Language Models (LLMs) still face several challenges when used to enhance information retrieval and search applications. One significant challenge is hallucination, where LLMs generate information based on their interpretation of a query, potentially lacking in factuality.
Retrieval-Augmented Generation (RAG) is a solution to address this issue. RAG leverages a retrieval component to identify relevant information in a knowledge base before passing it to the LLM for generation. The retrieval component ensures the relevance of the retrieved information by leveraging a text embedding model and similarity-based ranking between queries and content. Consequently, its generated output is grounded to verified data.
In this scenario, a meaningful representation of both queries and content is essential for effective retrieval. However, pre-trained LLMs do not guarantee this representation for retrieval tasks with specific data. Fine-tuning the embedding model with retrieval-specific domain data is expected to significantly improve information retrieval capabilities.
On Google Cloud, Vertex AI provides a text-embeddings API to create text embeddings with pretrained textembedding-gecko and textembedding-gecko-multilingual text embedding models. In this article, you will learn how to tune the textembedding-gecko text embedding model for adapting to your retrieval-specific domain data.
By the end of this reading, you will have a better understanding of the process involved in tuning a text embedding model, covering key steps such as preparing the embeddings dataset, running a tuning job, evaluating model performance, and deploying the tuned model to retrieve relevant content.
Introduction to text embeddings on Vertex AI
On Google Cloud, Vertex AI’s text-embeddings API enables users to generate their text embeddings. The API offers various versions of Gecko, a well-performing, condensed and flexible text embedding model. Gecko’s retrieval performance results from a fundamental concept: extracting knowledge from expansive language models (LLMs) and incorporating it into a retriever. To know more about Gecko, check out the original paper.
On Vertex AI you can choose between the textembedding-gecko and textembedding-gecko-multilingual models depending on the languages of the document you want to represent. Also you can condition the way the embeddings are generated according to the downstream application and get better quality embeddings. For example, you can set RETRIEVAL_QUERY and RETRIEVAL_DOCUMENT tasks to specify that the query and the document are consumed for search and retrieval purposes respectively. Another parameter you can set is related to the dimensionality of the generated embeddings for storage optimization.
Both textembedding-gecko and textembedding-gecko-multilingual text embeddings foundation models have been trained on a large set of general text data. And if your use case involves domain-specific data (like financial documents), you may need to fine-tuning the text embedding model in order to enhance the performance of the RAG application you are developing.
In research, there are several approaches to fine-tune a text embedding model. For example, Cloud AI research recently introduced a novel framework, Search-Adaptor, for customizing LLMs for information retrieval in an efficient and robust way. According to the paper, Search-Adaptor does not require access to the LLM weights and modifies the pre-trained LLM embeddings by learning from paired query-document data, yielding significantly improved retrieval performance on target tasks.
In this following section, you explore the process of tuning the textembedding-gecko text embedding model on Vertex AI.
Tune text embeddings on financial documents with Vertex AI
Imagine that you build a RAG-based LLM application to question-answering on financial documents. Below you have a high level picture of the RAG application.
For simplicity, in this scenario you have only the 2023 Q3 Alphabet Earnings Release PDF document which is a detailed report of Alphabet’s financial performance during the fourth quarter and fiscal year of 2023.
The document is preprocessed using Document AI, a managed AI service to process unstructured data and a LangChain splitter. Each resulting chunk is converted into a text embedding using textembedding-gecko model on Vertex AI.
Then each embedding is indexed in Vertex AI Vector search, the managed vector similarity-matching service on Vertex AI. So each time a new query is submitted, the query is converted into the corresponding text embeddings using textembedding-gecko model on Vertex AI, the most similar indexes are returned by Vertex AI Vector Search and used to retrieve the most similar content within the Memory Store, a managed in-memory Redis which store the chunk content.
Finally, the relevant content is passed to Gemini-Pro to generate the grounded response.
Given the actual RAG system configurations, below you have resulting test and validation nDCG@10 metrics with the base textembedding-gecko model.
where nDCG@10 indicates the correctness of the document ranking over in the 10-n ranked list with respect to the best possible ranking. Its value range is between 0 and 1. Higher the score better is the model to return content in the correct order of relevance. In this case, a nDCG@10 score of 0.63 means that the embeddings model performed reasonably well. But there is still room for improvement. And you may want to consider fine-tuning the model.
To tune the textembedding-gecko model on Vertex AI, you need to cover the following steps:
First of all, you have to prepare the tuning dataset. According to the Vertex AI documentation, the tuning dataset consists of the three corpus, query and labels files with certain data requirements.
The corpus file is a JSONL file where each line has the fields _id, title (optional), and text of each relevant chunk. The query file is a JSONL file where each line has the fields _id, and text of each relevant query. The labels files are TSV train, test (and val) files with query-id, corpus-id, and score columns. The query-id represents the query id in the query file, the corpus-id represents the corpus id in the corpus file, and score indicates relevance with higher scores meaning greater relevance. A default score of 1 is used if none is specified.
To extract text chunks from the Alphabet Earnings Release PDF for the corpus file, you can use DocAIParser and RecursiveCharacterTextSplitter in LangChain. First you upload your document to the Cloud Storage bucket. Then you create a Document AI OCR processor to identify and extract text in PDF documents. Next you pass the processor to the DocAIParser for running a batch processing job. The job will return LangChain Documents containing the extracted text and metadata. Finally you split Documents text into smaller chunks for a chosen size based on a set of specific characters using the RecursiveCharacterTextSplitter.
Here is an example of the pseudocode you might have for extracting the document content.
# Create the DocAI OCR processor
processor = create_processor(PROJECT_ID, LOCATION, PROCESSOR_ID)
# Initialize the LangChain DocAI Parser
parser = DocAIParser(
processor_name=processor.name,
location=LOCATION,
gcs_output_path=PROCESSED_DATA_OCR_URI,
)
# Process the document as blob
operations = parser.docai_parse([blob])
# Get the parser results
results = parser.get_results(operations)
# Parse the results as LangChain Document
docs = list(parser.parse_from_results(results))
# Show result
print(docs[0])
#Document(page_content='...', metadata={'page': 1, 'source': #'gs://.../goog-10-k-2023.pdf'})
And here you have the pseudocode example of how to get the text chunks.
# Initiate the splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=2500,
chunk_overlap=250,
length_function=len,
is_separator_regex=False,
)
#Create text chunks
document_content = [doc.page_content for doc in docs]
document_metadata = [{"page": idx} for idx, doc in enumerate(docs, 1)]
chunks = text_splitter.create_documents(document_content, metadatas=document_metadata)
For a real-world scenario, gathering relevant queries for each chunk for the query file requires logging user requests from your LLM-based application. For demonstration purposes, you can utilize a large language model (LLM) to generate hypothetical questions that are relevant for each chunk. This approach enables the generation of synthetic relevant query-chunk pairs in a scalable manner. Below you have an example of the code for generating relevant queries per chunk.
generated_queries = [generate_queries(chunk=chunk, num_questions=3) for chunk in chunks]
where num_questions indicates the number of queries you want the LLM to generate per each chunk.
After you collect both your chunks and your queries, you can create the training and test labels by mapping relevant queries for each chunk as you can see below.
# Create the corpus file
corpus_df = pd.DataFrame(
{
"_id": ["text_" + str(idx) for idx in range(len(generated_queries))],
"text": [chunk.page_content for chunk in chunks],
"doc_id": [chunk.metadata["page"] for chunk in chunks],
}
)
# Create the query file
query_df = pd.DataFrame(
{
"_id": ["query_" + str(idx) for idx in range(len(generated_queries))],
"text": [query.page_content for query in generated_queries],
"doc_id": [query.metadata["page"] for query in generated_queries],
}
)
# Create the score file
score_df = corpus_df.merge(query_df, on="doc_id")
score_df["score"] = 1
train_df = score_df.sample(frac=0.8)
test_df = score_df.drop(train_df.index)
where the score column is always equal to 1 because your synthesized queries are equally relevant by definition.
Now that you have corpus, query and labels files, you store all three resulting data structures as JSONL files in Google Cloud storage.
# Save the tuning dataset
corpus_df.to_json(
f"gs://.../corpus.jsonl",
orient="records",
lines=True,
)
query_df.to_json(
"gs://.../query.jsonl", orient="records", lines=True
)
train_df.to_csv(
"gs://.../train.tsv",
sep="\t",
header=True,
index=False,
)
test_df.to_csv(
"gs://.../test.tsv",
sep="\t",
header=True,
index=False,
)
Here you have an example of how the resulting corpus, query and labels JSONL files would look like.
# corpus
{"_id":"text_1","text":"Table of Contents\n• the expected timing, amount...", doc_id: '0'}
{"_id":"text_2","text":"companies, or any relationship with any of the...", doc_id: '1'}
{"_id":"text_3","text":"in research and development in the last five y...", doc_id: '2'}
# query
{"_id":"query_0","text":"What are the risks associated with forward-loo...", doc_id: '0'}
{"_id":"query_1","text":"What are the three main business segments repo...", doc_id: '1'}
{"_id":"query_2","text":"When did Google first incorporate machine lear..", doc_id: '2'}
# (train) labels
corpus-id query-id score
text_1 query_0 1
text_1 query_1 1
text_1 query_2 1
Once you have your tuning dataset, you are ready to run an Embedding model tuning pipeline job on Vertex AI using Vertex AI Pipelines. To run a tuning pipeline job on Vertex AI, you need to Initiate a pipeline job by passing defined parameters including the Cloud Storage bucket paths with train and test datasets, the training batch size, the number of steps to perform model tuning and the embedding tuning pipeline template. Then, submit a pipeline run using the Vertex AI Python SDK, as shown in the following example.
ITERATIONS = len(train_df) // BATCH_SIZE
params = {
"batch_size": BATCH_SIZE,
"iterations": ITERATIONS,
"accelerator_type": TRAINING_ACCELERATOR_TYPE,
"machine_type": TRAINING_MACHINE_TYPE,
"base_model_version_id": "textembedding-gecko@003",
"queries_path": "gs://.../query.jsonl",
"corpus_path": "gs://.../corpus.jsonl",
"train_label_path": "gs://.../train.tsv",
"test_label_path": "gs://.../test.tsv",
"project": PROJECT_ID,
"location": REGION,
}
template_uri = "https://us-kfp.pkg.dev/ml-pipeline/llm-text-embedding/tune-text-embedding-model/v1.1.1"
job = aiplatform.PipelineJob(
display_name="tune-text-embedding",
parameter_values=params,
template_path=template_uri,
pipeline_root=PIPELINE_ROOT,
project=PROJECT_ID,
location=REGION,
)
job.run()
After the tuning pipeline job successfully runs, you can review it in the Vertex AI Pipelines UI by looking at artifacts generated by the pipeline itself.
Apart from the tuned model, one of the Vertex AI Pipeline job artifacts are evaluation metrics. The Vertex AI Pipeline automatically produces nDCG@10 for both test and validation datasets. Below you have the test and validation nDCG@10 metrics of the tuned textembedding-gecko model compared with the base model.
Tuning the model results in a nDCG@10 improvement compared with the base textembedding-gecko model (see above) which means that top 10 chunks that will be retrieved are now more likely to be exactly the relevant ones for answering the input query. In other words, the most relevant information is now easier to find with the new tuned embedding model. And, as a consequence, it is a good candidate for deployment using Vertex AI Prediction.
To deploy the tuned embedding model, you can create a Vertex AI Endpoint using the Vertex AI Python SDK, as illustrated in the code below.
endpoint = aiplatform.Endpoint.create(
display_name="tuned_custom_embedding_endpoint",
description="Endpoint for tuned model embeddings.",
project=PROJECT_ID,
location=REGION,
)
After you create the Vertex AI Endpoint, you can get the tuned model from Vertex AI Model Registry using the pipeline job and deploy to the endpoint as shown below.
# Get the tuned model
model = get_uploaded_model(job)
#Deploy the tuned model to the endpoint
endpoint.deploy(
model,
accelerator_type=PREDICTION_ACCELERATOR_TYPE,
accelerator_count=PREDICTION_ACCELERATOR_COUNT,
machine_type=PREDICTION_MACHINE_TYPE,
)
Deploying the tuned embeddings model would require some time. But after deploying if, you will have an endpoint ready to receive prediction requests associated with the registered tuned model.
Given a query, the deployed embeddings model only produces the associated embeddings. To retrieve similar items using the tuned embedding model, you also need both the corpus text and the associated embeddings. Both the corpus text and the associated embeddings are tuning pipeline job artifacts which are stored in the Google Cloud bucket and you can read with a helper function. Once you have the corpus text and the associated embeddings, you can use a similarity function to find the most relevant document with respect to the query. Below you have an example for finding similar items for given queries.
queries = [
"""What about the revenues?""",
"""Who is Alphabet?""",
"""What about the costs?""",
]
output = get_top_k_documents(queries, corpus_text, corpus_embeddings, k=10)
where corpus text and embeddings are the collection of original chunks and the associated embeddings respectively while k is the number of relevant chunks you want to retrieve per each query. Below you have an example view of the resulting query-chunks table.
Conclusions
Large Language Models (LLMs) often struggle when applied to information retrieval and search applications. Retrieval-Augmented Generation (RAG) seeks to overcome this challenge. But RAG requires a meaningful representation of both queries and content for effective retrieval with specific domain data.
This article introduces you to Vertex AI’s text-embedding foundational models. And it guides you through the process of tuning those models to enhance their retrieval capabilities for a specific data set (financial data) by leveraging Vertex AI on Google Cloud.
As a conclusion, even though an synthetic domain-specific dataset is generated using Gemini, fine-tuning seems to effectively improve the information retrieval capabilities of the embedding model.
What’s Next
Do you want to know more about Vertex AI Embeddings APIs and how to tune Vertex AI Embeddings models? Check out the following resources!
Documentation
Github samples
Thanks for reading
I hope you enjoyed the article. If so, please clap or leave your comments. Also let’s connect on LinkedIn or X to share feedback and questions.
Special thanks to Sarah Dugan for feedback and contribution!