Architectural Blueprints for RAG Automation: Advanced Document Understanding using Vertex AI Search

Arun Shankar
Google Cloud - Community
36 min readMay 20, 2024

This post deep dives into using Vertex AI Search to streamline the creation and evaluation of retrieval-augmented generation (RAG) pipelines for advanced document question answering. We’ll illustrate this process with a practical example using financial documents like quarterly reports from major tech companies. You’ll learn how to seamlessly ingest, index, query, and thoroughly evaluate your RAG pipeline.

Critically, we’ll explore Vertex AI’s customizable API workflows, allowing you to tailor your RAG system for peak performance. We’ll also demonstrate how to harness Gemini Pro to enhance search results and generate higher quality answers. Additionally, we’ll dive into evaluating retrieval performance using search metrics like Precision, Recall, Mean Reciprocal Rank (MRR), Discounted Cumulative Gain (DCG) etc. We’ll also conduct evaluations and gain insights into assessing answer quality against established ground truths for both semantic and factual accuracy.

The primary focus of this guide is to build robust Vertex AI Search solutions, emphasizing its built-in techniques for advanced document understanding, particularly involving complex tables. This guide is meant to be comprehensive and you can use it as a starting blueprint for your own domain-specific document understanding RAG use cases with Vertex AI.

Building upon our previous exploration of accelerating document discovery with Vertex AI Search, we’ll now concentrate on how to leverage sourced documents to extract complex answers and valuable insights. All code for setup, experimentation, and evaluation are shared in this GitHub repository.

Document Understanding using Google Cloud Vertex AI
Image generated by Imagen 2 (Vertex AI Studio)

Generative AI presents a multitude of opportunities for developers and enterprises, revolutionizing business processes, transforming customer experiences, and unlocking new revenue streams. However, to fully realize this potential, builders and IT leaders must navigate a complex landscape, striking a balance between rapid experimentation and iteration of AI models, applications, and agents with the critical considerations of cost management, governance, and scalability.

At the recent Next’24 Google Cloud event, we unveiled Vertex AI Agent Builder, a game-changing solution that integrates our powerful Vertex AI Search and Conversation products with an array of advanced developer tools. This comprehensive offering empowers developers to create and deploy sophisticated AI-driven agents that can seamlessly address complex tasks and inquiries, ultimately driving innovation and efficiency across various domains.

Understanding Vertex AI Search

Vertex AI Search is a comprehensive platform within Google Cloud designed to empower organizations to craft customized search solutions for their employees and customers. This platform enables a Google Search-like experience across a wide array of data sources, including websites, structured data (e.g., BigQuery tables, JSON lines), and unstructured data (e.g., PDFs, HTML, text).

In our previous blog post, we discussed using Vertex AI Search to gather targeted webpages from publicly indexed websites. This method leverages pre-existing Google indexes of these webpages. We used this approach to build a knowledge discovery pipeline for mining PDF documents.

Consider this article a follow-up to our previous post, in which we handle the extracted PDF documents. These documents might already be readily available, allowing you to directly use them if they exist. They could also be proprietary documents confidential to your enterprise. Here we will explore how to ingest, handle, and comprehend the data in these documents to build a system that can answer complex queries, such as retrieving factual information or pulling numbers from financial tables in quarterly reports.

Primarily, Vertex AI Search is a fully managed platform from GCP that integrates Google Search-quality capabilities into your enterprise data, offering two key advantages:

  • Elevated Search Experiences: It transforms traditional keyword-based search into modern, conversational experiences, much like Google’s innovative generative search. This enhancement significantly improves the effectiveness of internal and customer-facing applications.
  • Enhanced Generative AI Applications: It also aids in answer generation within generative AI applications. By grounding generative AI in your enterprise data, Vertex AI Search ensures greater accuracy, reliability, and relevance, essential for real-world business use cases. It acts as a ready-to-use RAG system, simplifying the integration of search capabilities.

Building a custom RAG pipeline can be complex. Vertex AI Search simplifies this process by providing a ready-to-use solution. It streamlines every aspect of the search and discovery process, from data extraction and transformation to information retrieval and summarization, reducing it to a few simple clicks. As a result, you can swiftly develop powerful RAG-powered applications using Vertex AI Search as the retrieval engine.

While the ready-to-use solution offers considerable convenience, Vertex AI Search also grants developers detailed control. You can use the platform’s flexibility to customize each stage of the RAG pipeline, making it fit your unique needs. This hybrid approach allows you to strike the ideal balance between pre-built components and custom enhancements, ensuring your applications align perfectly with your specific use case.

Vertex AI Search achieves this via a comprehensive set of APIs. These APIs expose the underlying components of Vertex AI Search’s RAG system, enabling developers to cater to custom use cases or serve customers who need detailed control. This includes the Document AI Layout Parser API, Ranking API, Grounded Generation API, and Check Grounding API etc.

Let’s get started! First, we will dive into understanding our dataset, the foundation for our RAG pipelines. We’ll then learn how to effectively ingest this data into Vertex AI Search, organizing it for seamless retrieval. A key focus will be on indexing strategies within Vertex AI Search to ensure our AI can access the most relevant information when needed. We’ll probe into techniques for querying indexed documents, experimenting with various pipeline approaches. Finally, we’ll collect results and learn how to evaluate them for both retrieval accuracy and the quality of the answers generated. This journey will equip you with the knowledge to build smarter, more informed AI pipelines using the power of RAG and Vertex AI Search.

The Dataset

The dataset we’ll use for our experiments comprises of quarterly reports from three tech companies: Alphabet, Amazon, and Microsoft. Spanning Q1 2021 to Q4 2023, this dataset encompasses 36 documents (12 per company) over a three-year period.

To facilitate experimentation, we derived 100 question-answer pairs from these documents. Each pair is directly linked to a single document, establishing a single-hop question-answering scenario. The questions and answers, meticulously crafted, focus on extracting information from tables and complex passages, presenting a substantial challenge to RAG systems. This set of 100 question-answer pairs will serve as the ground truth for evaluating the performance of the various RAG pipeline designs we will cover here. The dataset of PDF financial quarterly reports can be found here. The ground truth, which displays the dataset of question-answer pairs, can be found in a CSV file named ground_truth.csv. This file also includes the following meta information — i)mapping document, ii) company name, and iii) the time period. This meta info is captured by a single field under the documentcolumn in the CSV file.

A sample table from Alphabet’s Q1 2020 report, summarizing financial results primarily related to operating income and margin is shown below.

A sample question from our ground truth CSV, along with the expected answer derived from the table seen above.

What was Google's operating income (in billions) at the end of March 2021, and
how did it compare to the same period of the previous year?
Google's operating income was $16.437 billion in Q1 2021. This was an increase
from $7.977 billion in Q1 2020.

To formulate this answer, we need to first infer specific details from the query to retrieve the right document. This involves navigating to the correct page, referencing the appropriate table, and parsing the column information. We then map the field to the column heading and find the correct pieces. Finally, we consolidate all this gathered information into a cohesive answer.

Note: Microsoft’s financial calendar follows a fiscal year that doesn’t align with the traditional calendar year. Their fiscal first quarter, for instance, covers the period from July to September. This means their Q1 earnings reports actually reflect the performance of the previous quarter in the normal calendar. This is already accounted for in the questions and document names.

Document Ingestion and Indexing

To effectively use Vertex AI Search for understanding financial documents and answering questions, we first need to prepare and ingest our data. This involves creating a dedicated data store in Vertex AI Search and importing our financial documents from Google Cloud Storage (GCS) into this repository. Luckily, Vertex AI Search handles the parsing, chunking, and indexing of the information automatically for you.

Next, we’ll configure a document search application that utilizes the ingested data to provide robust search and retrieval capabilities. By following these steps, we establish a solid foundation for efficient indexing and exploration of your financial documents. This enables us to quickly access the information we need for our experiments and the development of a robust pipeline for document question answering. Let’s explore each step in more detail.

I. Creating a Data Store:

A data store in Vertex AI Search is essentially a container where your processed documents are stored. To create the data store to contain our processed pieces, we need to assign a unique identifier and display name to your data store for easy recognition within your Vertex AI Search project. At this point, the data store does not contain any documents. We will push documents (ingest) into this data store as a next step. It is important to note that the raw PDF documents are actually stored in GCS.

The code snippet below provides a glimpse of how it’s done using the REST API for Vertex AI Search. You can also use the Vertex AI Python SDK. Refer to the documentation for the Discovery Engine Vertex AI here. The full code to create the data store can be found here.

url = f"https://discoveryengine.googleapis.com/v1alpha/projects/{config.PROJECT_ID}/locations/global/collections/default_collection/dataStores?dataStoreId={data_store_id}"

headers = {
'Authorization': f'Bearer {config.ACCESS_TOKEN}',
'Content-Type': 'application/json',
'X-Goog-User-Project': config.PROJECT_ID
}
data = {
'displayName': data_store_display_name,
'industryVertical': IndustryVertical.GENERIC,
'solutionTypes': SolutionType.SOLUTION_TYPE_SEARCH,
'contentConfig': DataStore.ContentConfig.CONTENT_REQUIRED,
'documentProcessingConfig': {
'defaultParsingConfig': {
'layoutParsingConfig': {}
}
}
}

response = requests.post(url, headers=headers, json=data)

II. Ingesting Documents from GCS:

Once your data store is created, we will begin ingesting your financial documents from the specified GCS bucket. This process involves specifying the URI of the GCS bucket where all the original raw PDF documents of our dataset are stored. Prior to this, we will also need to create a manifest file. This is a JSON file that captures all the metadata of the documents we are going to ingest into Vertex AI search. A sample row from this file, named metadata.json, is shown below.

{
"id": "1",
"jsonData": "{\"company\": \"alphabet\", \"time_period\": \"Q1 2021\"}",
"content": {
"mimeType": "application/pdf",
"uri": "gs://vais-rag-patterns/raw_docs/alphabet-q1-2021.pdf"
}
}

A sample preview of the required code for initiating ingestion is shown below. This utilizes the REST API, and the complete code can be found in the Git repository linked here.

url = f"https://discoveryengine.googleapis.com/v1/projects/{project_id}/locations/global/collections/default_collection/dataStores/{data_store_id}/branches/0/documents:import"

headers = {
"Authorization": f"Bearer {config.ACCESS_TOKEN}",
"Content-Type": "application/json"
}

data = {
"gcsSource": {
"inputUris": [gcs_input_uri]
}
}

response = requests.post(url, headers=headers, json=data)

III. Creating a Document Search Application:

After the successful ingestion of our documents, the final step is to create a document search application. This application will interact with our indexed data, providing the necessary tools and functionalities for searching, retrieving, and analyzing our financial documents.

The sample code needed to create this app is shown below. Note that we should enable the enterprise tier for search and activate advanced search using LLM to effectively perform document question answering. This process utilizes the REST API, but it can also be accomplished using the Python SDK. The complete code for application creation can be found here.

url = f"https://discoveryengine.googleapis.com/v1alpha/projects/{config.PROJECT_ID}/locations/global/collections/default_collection/engines?engineId={data_store_id}"

headers = {
"Authorization": f"Bearer {config.ACCESS_TOKEN}",
"Content-Type": "application/json",
"X-Goog-User-Project": config.PROJECT_ID
}

data = {
"displayName": data_store_display_name,
"dataStoreIds": [data_store_id],
"solutionType": SolutionType.SOLUTION_TYPE_SEARCH,
"searchEngineConfig": {
"searchTier": SearchTier.SEARCH_TIER_ENTERPRISE,
"searchAddOns": SearchAddOn.SEARCH_ADD_ON_LLM
}
}

response = requests.post(url, headers=headers, json=data)

To facilitate the entire process outlined above, you can utilize the provided script here, which handles all the necessary steps involved in data ingestion and application setup. By following this structured approach, you’ll harness the power of Vertex AI Search to transform your financial documents into a valuable, easily accessible knowledge base.

Architecture Patterns for RAG Automation

With our raw PDF documents ingested and indexed within Vertex AI Search, querying processed documents and generating answers can now be easily streamlined. The provided Python SDK sample code demonstrates how to query the datastore via the previously configured search application. The full code is available here for reference.

client_options = (
ClientOptions(api_endpoint=f"{LOCATION}-discoveryengine.googleapis.com")
if LOCATION != "global"
else None
)

client = discoveryengine.SearchServiceClient(client_options=client_options)

serving_config = client.serving_config_path(
project=config.PROJECT_ID,
location=LOCATION,
data_store=data_store_id,
serving_config="default_config",
)

content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(
snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
return_snippet=False
),
extractive_content_spec=discoveryengine.SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
max_extractive_answer_count=3,
max_extractive_segment_count=3,
),
summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
summary_result_count=5,
include_citations=True,
ignore_adversarial_query=False,
ignore_non_summary_seeking_query=False,
),
)

request = discoveryengine.SearchRequest(
serving_config=serving_config,
query=search_query,
filter=filter_str,
page_size=5,
content_search_spec=content_search_spec,
query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
),
spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
),
)

response = client.search(request)

Building on this, we can now readily establish RAG pipelines using Vertex AI Search APIs in various ways. Next, we will look into four common patterns, illustrating the flexibility and ease with which such pipelines can be implemented.

When constructing a search request within Vertex AI Search, it’s crucial to configure specifications for extracting valuable information. Enabling options for snippets, segments, and answers ensures a comprehensive retrieval of relevant content. Additionally, activating the LLM-powered summarization feature generates a concise summary (answer) of the search results, enhancing user experience. The resulting JSON response will contain both the summarized answer and the extracted segments and answers.

Vertex AI Search employs three distinct methods for segmenting and extracting text data:

  • Snippets: Brief excerpts from search result documents, offering a preview of the content and often incorporating hit highlighting.
  • Extractive Answers: Verbatim text directly extracted from the original document, providing concise, contextually relevant answers.
  • Extractive Segments: More verbose verbatim text extracts, suitable for answer presentation, post-processing tasks, and as input for large language models.

Furthermore, it’s possible to configure settings for automatic spell correction and query expansion, refining search accuracy and broadening potential results. For our use case, we disregard snippets.

Pattern I: Search with Out-of-the-Box (OOB) Answer Generation

Pattern I is a simple and common pipeline that can be implemented through the Vertex AI Agent Builder console or the Discovery Engine API. This pipeline uses Vertex AI Search to retrieve relevant pieces of information from search indexes that we have already set up previously via the datastore and the search app. The indices map to our original raw PDFs stored on GCS. The pipeline also generates concise answers based on the retrieved search results, using an internal system powered by an LLM. This eliminates the need for explicit calls to external LLMs. With just one API request, users can receive a detailed answer along with citations from supporting documents. This simplifies the process of retrieving and summarizing relevant information. You can find the code demonstrating this pipeline in our shared repository here.

The diagram above illustrates this simple RAG pipeline powered by Vertex AI Search. The workflow begins with a collection of raw PDF documents stored in Cloud Storage. These documents are already ingested and processed by Vertex AI Search, which creates structured indexes to facilitate efficient search and retrieval. When a user submits a query, Vertex AI Search leverages these indexes to quickly identify the most relevant documents. It then extracts pertinent information from these documents and provides an answer to the user, along with citations indicating the source of the information.

A notable advantage of Pattern I is its simplicity and self-contained nature, eliminating the need for external LLM calls and consolidating the entire process into a single API request. However, this streamlined approach may introduce limitations in flexibility, potentially leading to the inclusion of extraneous information from less relevant documents in the final summary. Given our focus on single-hop question answering, where the answer is derived from a single source, ensuring the generation of answers solely from the topmost retrieved result becomes crucial.

In the subsequent pipeline iteration (Pattern II), we will explore the integration of metadata on the query understanding side, leveraging Gemini as the chosen LLM to enhance retrieval performance and refine results. This approach aims to minimize noise and optimize the summarized answer by focusing on the most relevant retrieved information.

Pattern II: Filtered Search with OOB Answer Generation

Pattern II enhances search by incorporating a pre-retrieval step focused on query understanding through filtering. This addresses scenarios where users prefer natural language queries, eliminating the need for manual filtering or accommodating situations where the desired filters might not be readily available.

The pipeline begins by processing the user’s natural language query using Gemini, which performs named entity recognition (NER) to extract key information such as company names and time periods. This extracted metadata, outputted as JSON, is then utilized to filter search results, aligning them more closely with the user’s intent. This process not only reduces noise but also significantly improves search metrics.

Also, more importantly, the extracted company name and time period information should conform to the formats and syntax established during the ingestion phase. In this implementation, metadata filtering leverages the company name and time period tags previously assigned to documents within Vertex AI Search.

Thus, this refinement builds upon Pattern I by introducing a query understanding step prior to retrieval, ultimately enhancing the accuracy and relevance of the retrieved results.

The workflow, as depicted in the architecture diagram above, begins with the user initiating a search using natural language, expressing their information need without the constraint of specific filters. The query is then passed to Gemini, which performs Named Entity Recognition (NER) to extract pertinent metadata such as company names and time periods. This extracted metadata is structured into JSON format for easy filtering and is subsequently utilized by Vertex AI Search to filter the document indexes, narrowing down the results to those that align with the extracted company and time period information. Finally, the filtered, and thus more relevant, documents are presented to the user, along with citations indicating the source of the information.

The code implementation of the described pipeline, is available for reference here. Entity recognition within this implementation utilizes prompts, employing a zero-shot approach for company name extraction and a few-shot approach with positive and negative examples for time period extraction.

Given the query below, extract the company name from it.
The company name can be either `Microsoft`, `Alphabet`, or `Amazon`.
If company name is `LinkedIn`, translate to `Microsoft`.

IMPORTANT: The extracted company name must be a single word ONLY without any
breaklines or punctuations or extra whitespaces.
Given a query, extract the specific time period from it. A valid time period 
should be in the form 'Q1 2021' only.

Examples of invalid formats include:
'Q2 2020 to Q2 2021'
'Q2 2020 - Q2 2021'
'Q2 2020, Q2 2021'

The extracted time period should represent only one quarter and one year,
corresponding to the present vs past.

IMPORTANT: Ignore past references when the query is comparing the present to
the past.

Examples
========
Translate 'first quarter of 2020' to 'Q1 2020'.
Translate 'increase in Q2 2021 compared to Q2 2020' to 'Q2 2021'.
Translate 'twelve months ending December 31, 2022,' to 'Q4 2022'.

Pattern III: Filtered Search with Answer Generation using Extractive Segments and Gemini Pro

Previously, we learned about three different levels of granularity for retrieving relevant context in Vertex AI Search: snippets, extractive segments, and extractive answers. In this pattern, Pattern III, we leverage the extracted segments from search results and utilize them to replace the out-of-the-box (OOB) answer generation steps (Pattern I and II). This is especially important when needing a specific style, nuance, format, length, or structure for the generated answer. In such scenarios, we can take the extracted segments and explicitly pass them to an external LLM like Gemini as context, along with the query. We can design prompts flexibly in various ways to tailor the answer generation closer to our requirements. Pattern III encompasses this approach.

The architecture diagram shown above depicts Pattern III’s workflow. Here, the user first submits a natural language query, which is initially processed by Gemini to extract relevant metadata. This structured metadata, similar to Patterns I and II, is then used by Vertex AI Search to efficiently search through document indexes and identify relevant documents. From these filtered documents, specific segments that directly answer the user’s query are extracted. In the final stage, Gemini processes these extracted segments to generate a comprehensive answer for the user, incorporating citation information to indicate the source documents. The user ultimately receives a final answer that addresses their query along with references to the documents from which the information was extracted. Code covering this pattern can be found here.

The prompt used for answer generation in this workflow can be as simple as the one shown below or more complex and elaborate based on your specific needs.

Based on the following context, provide a clear and concise answer to the 
question below:

Context: {context}

Question: {question}

Pattern IV: Filtered Search with Answer Generation using Extractive Answers and Gemini Pro

In this pipeline iteration, we largely maintain the structure of Pattern III, with a key modification: instead of utilizing extractive segments from retrieved documents, we leverage extractive answers. This alteration is made while retaining the previous prompt structure. To illustrate the impact of this change, we will examine a question from our test set and compare the generated answers derived from segments versus answers.

A sample question and the expected answer from the ground truth file are shown below.

What was LinkedIn's revenue increase in Q1 2021 according to Microsoft's 
earnings report, and what was the growth rate when adjusted for constant
currency?
In Q1 2021, LinkedIn's revenue increased by 25% 
year-over-year. When adjusted for constant currency, the growth rate
was 23%.

First, let’s examine the extractive segments returned by Vertex AI search for the sample question. Below are the top three segments. We can see that the answer to our question can be derived from both Segment 1, under the Business Highlights, and Segment 2, as part of the parsed table content. To generate the final answer using Gemini post-retrieval, we simply concatenate the segments and pass them as a single context.

Extractive Segment 1

Business Highlights
Revenue in Productivity and Business Processes was $13.6 billion and increased
15% (up 12% in constant
currency), with the following business highlights:

• Office Commercial products and cloud services revenue increased 14%
(up 10% in constant currency) driven by Office 365 Commercial revenue
growth of 22% (up 19% in constant currency)
• Office Consumer products and cloud services revenue increased 5%
(up 2% in constant currency) and Microsoft 365 Consumer subscribers
increased to 50.2 million

• LinkedIn revenue increased 25% (up 23% in constant currency)
• Dynamics products and cloud services revenue increased 26% (up 22% in
constant currency)
driven by Dynamics 365 revenue growth of 45% (up 40% in constant currency)

Extractive Segment 2

Financial Performance Constant Currency Reconciliation

Three Months Ended March 31,

($ in millions, except per share amounts)

Revenue

Operating
Income

Net
Income

Diluted
Earnings
per Share

2020 As Reported (GAAP)

$35,021

$12,975

$10,752

$1.40

2021 As Reported (GAAP)

$41,706

$17,048

$15,457

$2.03

2021 As Adjusted (non-GAAP)

$41,706

$17,048

$14,837

$1.95

Percentage Change Y/Y (GAAP)

19%

31%

44%

45%

Percentage Change Y/Y (non-GAAP)

19%

31%

38%

39%

Constant Currency Impact

$972

$634

$615

$0.08

Percentage Change Y/Y (non-GAAP) Constant Currency

16%

27%

32%

34%

Segment Revenue Constant Currency Reconciliation

Three Months Ended March 31,

($ in millions)

Productivity and
Business Processes

Intelligent Cloud

More Personal
Computing

2020 As Reported

$11,743

$12,281

$10,997

2021 As Reported

$13,552

$15,118

$13,036

Percentage Change Y/Y

15%

23%

19%

Constant Currency Impact

$366

$367

$239

Percentage Change Y/Y Constant Currency

12%

20%

16%

Selected Product and Service Revenue Constant Currency Reconciliation

Three Months Ended March 31, 2021

Percentage Change
Y/Y (GAAP)

Constant

Currency Impact

Percentage Change
Y/Y Constant
Currency

Office Commercial products and cloud services

14%

(4)%

10%

Office 365 Commercial

22%

(3)%

19%

Office Consumer products and cloud services

5%

(3)%

2%

LinkedIn

25%

(2)%

23%

Dynamics products and cloud services

26%

(4)%

22%

Dynamics 365

45%

(5)%

40%

Server products and cloud services

26%

(3)%

23%

Azure

50%

(4)%

46%

Windows OEM

10%

0%

10%

Windows Commercial products and cloud services

10%

(3)%

7%

Xbox content and services

34%

(2)%

32%

Surface

12%

(5)%

7%

Search advertising excluding traffic acquisition costs

17%

(3)%

14%

About Microsoft

Microsoft (Nasdaq "MSFT" @microsoft) enables digital transformation for the
era of an intelligent cloud and an intelligent edge. Its mission is to
empower every person and every organization on the planet to achieve more.

Extractive Segment 3

Revenue in Intelligent Cloud was $15.1 billion and increased 23% 
(up 20% in constant currency), with the following business highlights:
• Server products and cloud services revenue increased 26% (up 23% in
constant currency) driven by Azure revenue growth of 50% (up 46% in
constant currency)
Revenue in More Personal Computing was $13.0 billion and increased 19%
(up 16% in constant currency), with the following business highlights:

• Windows OEM revenue increased 10%
• Windows Commercial products and cloud services revenue increased 10%
(up 7% in constant currency)

• Xbox content and services revenue increased 34% (up 32% in constant currency)
• Search advertising revenue excluding traffic acquisition costs increased 17% (up 14% in constantcurrency)
• Surface revenue increased 12% (up 7% in constant currency)

Microsoft returned $10.0 billion to shareholders in the form of share
repurchases and dividends in the third quarter of fiscal year 2021, an
increase of 1% compared to the third quarter of fiscal year 2020.

Business Outlook

Microsoft will provide forward-looking guidance in connection with this
quarterly earnings announcement on its earnings conference call and webcast.
Quarterly Highlights, Product Releases, and Enhancements
Every quarter Microsoft delivers hundreds of products, either as new releases,
services, or enhancements to current products and services. These releases
are a result of significant research and development investments, made over
multiple years, designed to help customers be more productive and secure and
to deliver differentiated value across the cloud and the edge.
Here are the major product releases and other highlights for the quarter,
organized by product categories, to help illustrate how we are accelerating
innovation across our businesses while expanding our market opportunities.

Responding to COVID-19
At Microsoft, our focus remains on ensuring the safety of our employees,
striving to protect the health and well-being of the communities in which
we operate, and providing technology and resources to our customers and
partners to help them do their best work while remote. Additional information
about Microsoft's COVID-19 response can be found here.

Environmental, Social, and Governance (ESG)
To better execute on Microsoft's mission, we focus our Environmental,
Social, and Governance (ESG) efforts where we can have the most positive
impact. To learn more about our latest initiatives and priorities,
please visit our investor relations ESG website.

Below is the answer generated using the top three extractive segments. As you can see, the answer is more detailed and lengthy compared to our expected answer.

LinkedIn's Revenue Increase in Q1 2021:

According to Microsoft's earnings report, LinkedIn's revenue increased
by 25% in Q1 2021 compared to the same period in 2020.

When adjusted for constant currency, the growth rate was 23%.
This means that the increase in revenue was not solely due to fluctuations
in currency exchange rates.

Here's a breakdown of the information:

* Reported Revenue Increase: 25%
* Constant Currency Growth Rate: 23%

This information can be found in the "Financial Performance" and
"Selected Product and Service Revenue Constant Currency Reconciliation"
sections of the provided context.

Now, let’s take a look at the extractive answers returned by Vertex AI search, available as part of our search response. In the search request sent previously, we set the search spec to return the top three matching extractive answers for the top relevant document. You can see the returned extractive answers below. They are similar to segment spans of text derived from the top matching document — Microsoft Q1 2021, except being more concise and shorter.

Interestingly, we can see that the information needed to answer our question is only contained in the last (third) extractive answer. To generate the final answer, we concatenate all three extractive answers into one single context, feed it to Gemini alongside the original question, and generate the answer.

Extractive Answer 1

Microsoft Cloud Fuels Third Quarter Results REDMOND, Wash. - April 27, 2021 
- Microsoft Corp. today announced the following results for the quarter ended
March 31, 2021, as compared to the corresponding period of last fiscal year:
• Revenue was $41.7 billion and increased 19% • Operating income was $17.0
billion and increased 31% • Net income was $15.5 billion GAAP and $14.8
billion non-GAAP, and increased 44% and 38%, respectively • Diluted earnings
per share was $2.03 GAAP and $1.95 non-GAAP, and increased 45% and 39%,
respectively • GAAP results include a $620 million net income tax benefit
explained in the Non-GAAP Definition section below "Over a year into the
pandemic, digital adoption curves aren't slowing down. They're
accelerating, and it's just the beginning," said Satya Nadella,
chief executive officer of Microsoft. "We are building the cloud for the
next decade, expanding our addressable market and innovating across every
layer of the tech stack to help our customers be resilient and transform."
"The Microsoft Cloud, with its end-to-end solutions, continues to provide
compelling value to our customers generating $17.7 billion in commercial
cloud revenue, up 33% year over year," said Amy Hood, executive vice
president and chief financial officer of Microsoft.

Extractive Answer 2

Revenue in Intelligent Cloud was $15.1 billion and increased 23% 
(up 20% in constant currency), with the following business highlights:
• Server products and cloud services revenue increased 26% (up 23% in
constant currency) driven by Azure revenue growth of 50% (up 46% in
constant currency) Revenue in More Personal Computing was $13.0 billion
and increased 19% (up 16% in constant currency), with the following
business highlights: • Windows OEM revenue increased 10% • Windows Commercial
products and cloud services revenue increased 10% (up 7% in constant currency)
• Xbox content and services revenue increased 34% (up 32% in constant currency)
• Search advertising revenue excluding traffic acquisition costs increased
17% (up 14% in constant currency) • Surface revenue increased 12% (up 7% in
constant currency) Microsoft returned $10.0 billion to shareholders in the
form of share repurchases and dividends in the third quarter of fiscal year
2021, an increase of 1% compared to the third quarter of fiscal year 2020.
Business Outlook Microsoft will provide forward-looking guidance in connection
with this quarterly earnings announcement on its earnings conference call
and webcast.

Extractive Answer 3

Financial Performance Constant Currency Reconciliation Three Months Ended 
March 31, ($ in millions, except per share amounts) Revenue Operating Income
Net Income Diluted Earnings per Share 2020 As Reported (GAAP) $35021 $12975
$10752 $1.40 2021 As Reported (GAAP) $41706 $17048 $15457 $2.03 2021 As
Adjusted (non-GAAP) $41706 $17048 $14837 $1.95 Percentage Change Y/Y (GAAP)
19% 31% 44% 45% Percentage Change Y/Y (non-GAAP) 19% 31% 38% 39% Constant
Currency Impact $972 $634 $615 $0.08 Percentage Change Y/Y (non-GAAP)
Constant Currency 16% 27% 32% 34% Segment Revenue Constant Currency
Reconciliation Three Months Ended March 31, ($ in millions) Productivity
and Business Processes Intelligent Cloud More Personal Computing 2020 As
Reported $11743 $12281 $10997 2021 As Reported $13552 $15118 $13036
Percentage Change Y/Y 15% 23% 19% Constant Currency Impact $366 $367 $239
Percentage Change Y/Y Constant Currency 12% 20% 16% Selected Product and
Service Revenue Constant Currency Reconciliation Three Months Ended
March 31, 2021 Percentage Change Y/Y (GAAP) Constant Currency Impact
Percentage Change Y/Y Constant Currency Office Commercial products and
cloud services 14% (4)% 10% Office 365 Commercial 22% (3)% 19% Office
Consumer products and cloud services 5% (3)% 2% LinkedIn 25% (2)% 23%
Dynamics products and cloud services 26% (4)% 22% Dynamics 365 45% (5)%
40% Server products and cloud services 26% (3)% 23% Azure 50% (4)% 46%
Windows OEM 10% 0% 10% Windows Commercial products and cloud services 10%
(3)% 7% Xbox content and services 34% (2)% 32% Surface 12% (5)% 7% Search
advertising excluding traffic acquisition costs 17% (3)% 14% About Microsoft
Microsoft (Nasdaq “MSFT” @microsoft) enables digital transformation for
the era of an intelligent cloud and an intelligent edge.

We can see that the final answer generated by Gemini is shorter and more concise than our previous answer generated using extractive segments.

Based on the provided context, LinkedIn's revenue increase in Q1 2021, 
according to Microsoft's earnings report, was 25%. When adjusted for
constant currency, the growth rate was 23%.

Source code covering this pattern can be found here.

Alternative Patterns

  1. Beyond standard workflows discussed above, several advanced techniques can significantly enhance answer generation from PDF documents. One such method is query expansion, which broadens the initial search query with related terms or synonyms. This can be easily enabled with Vertex AI Search by setting the parameter to AUTO. Alternatively, a DIY pre-retrieval step can be designed using an LLM to generate query variants, followed by parallel calls to Vertex AI Search. Query expansion is a critical technique for enhancing the quality of information retrieval in question-answering systems. It not only improves search relevance by generating diverse query variants but also plays a pivotal role in ensuring the representativeness of the top-retrieved documents, which is essential for generating accurate answers
  2. Keyword boosting within documents is another powerful technique to improve relevance. This is supported out of the box with Vertex AI search. By prioritizing certain terms, the relevance of retrieved results can be improved.
  3. Additionally, we can improve retrieval performance by using search tuning. This approach is particularly beneficial for industry-specific or company-specific queries not well addressed by general language models. Search tuning is supported out of the box by Vertex AI Search.
  4. Retrieval relevance can also be improved by choosing the appropriate type of pre-processing for PDF documents using different document parsers. Vertex AI Search primarily supports three types: layout parser, OCR parser, and digital parser. For our use case, we used the layout parser. This is recommended for PDF documents if you plan to use Vertex AI Search for RAG. Alternatively, we can employ other sophisticated methodologies, like extracting tables from documents using Document AI in markdown format. We can then convert the PDF, along with the extracted table, to text format and ingest it into Vertex AI Search instead of treating them as PDFs.
  5. Lastly, for enterprises requiring custom embedding-based information retrieval, Vertex AI offers robust vector search capabilities. Vertex AI’s Vector Search can scale to accommodate billions of vectors and identify nearest neighbors within milliseconds. Vector Search (previously known as Matching Engine) is similar to Vertex AI Search and is part of Agent Builder. Think of Agent Builder as an encapsulation to both of these search options — what we covered extensively in this article was all about Vertex AI Search. Vertex AI Vector Search on the other hand is a vector store with other supporting functionalities that you can use for developing alternative fully customizable DIY RAG pipelines. This is a good alternative if you want full customizability to everything from chunking strategy, choice of embedding model, choice of scoring algorithm for semantic similarity retrieval, etc. Agent Builder also includes standalone APIs, such as check grounding, grounded generation and ranking APIs. These can be used to build custom RAG pipelines in conjunction with Vector Search.

Evaluating RAG Pipelines

Scalpel of Search (Image generated by Imagen 2)

Next, let’s explore how to evaluate retrieval performance and generated answer quality. We’ll begin by experimenting with the various metrics suitable for evaluating retrieval systems, followed by metrics for assessing answer quality in RAG pipelines. This evaluation process will help us refine and optimize our RAG systems, as well as understand which approach is more effective.

I. Evaluating Retrieval

a) Precision @ K

Precision @ K is a metric that quantifies the proportion of relevant results in the top K retrieved documents. This measure is particularly important in scenarios where the quality of the initial results takes precedence over the exhaustive retrieval of information. A prime example of this is a web search engine, where users predominantly focus on the first page of search results.

Consider a scenario where one requests the top 5 instructional videos on “how to craft a paper airplane” on YouTube. If three out of these five videos successfully provide the intended instruction, the Precision @5 would be 3 out of 5, or 60%. This metric provides a means to quantify the relevance of the initial search results.

In the context of our use case — single-hop question answering — our objective is to procure one relevant item (document) per query. Thus, it is logical to calculate Precision @ K=1. In this instance, the result is binary: we either locate the desired document at the top of the retrieved items as returned by the Vertex AI Search or we do not.

b) Recall @ K

In the realm of information retrieval, the metric Recall @ K is utilized to measure the proportion of all relevant documents that are retrieved within the top K results from a search. This differs from Precision in that Recall prioritizes the system’s ability to identify all potential relevant documents, constrained to the top K documents considered. This measure is paramount in fields such as legal or academic research, where the omission of key documents could lead to severe consequences.

To illustrate, assume a scenario where there are 10 pertinent videos on the construction of paper airplanes and the interest is in the top 5 results. If 4 out of these 10 videos are present in the top 5 results, then the Recall @ 5 would be 4 out of 10, or 40%. This percentage is indicative of the search system’s effectiveness in capturing all possible relevant results within the top K results.

Considering the unique context of our use case, it is logical to compute Recall @ K=1, given that there is only one relevant document. Analogous to precision, this will yield a binary outcome, which indicates whether the particular item was retrieved or not.

Note: For our given use case, it is single-hop question answering, meaning the answer is always mapped to only one document, unlike multi-hop where an answer can be derived from multiple documents. Therefore, both retrieval @ k and precision @ k are set to 1. For multi-hop scenarios, higher values such as k @ 3 and k @ 5 can be more useful and beneficial.

In our use case, both precision and recall at k=1 for standard search without filters stand at 51% (Pattern I). However, these metrics increase to 90% when we use Named Entity Recognition (NER) and apply filters (Pattern II).

c) MRR (Mean Reciprocal Rank)

Mean Reciprocal Rank (MRR) is a statistic that measures the average inverse rank of the first correct answer in a list of responses. MRR is particularly useful in situations where the placement of the first relevant document is more important than the presence of additional relevant documents. This metric is commonly used in question-answering systems and other search contexts where the user is likely to be satisfied by the first correct answer they encounter.

Imagine you’re using a search engine to find the perfect recipe for macadamia cookies. If the very first recipe you click on is exactly what you need, the search engine scores a perfect Mean Reciprocal Rank (MRR) of 1, indicating it provided the best result right away. However, if the ideal recipe isn’t the first one you check, but rather the third one you find appealing, the MRR for that search drops to 1/3.

The formula for MRR is:

where 𝑄 is the total number of queries, and rᵢ is the rank position of the first relevant answer for the i -th query.

For our use case, with Pattern I (document search without filters), the MRR equals 64%. However, with Pattern II, when we apply the filters, this number increases to 91%.

d) DCG (Discounted Cumulative Gain)

DCG measures the usefulness, or “gain,” of a document based on its position in the result list. The assumption here is that documents appearing earlier in the search results are more relevant to the user than those appearing later. The “discounted” part refers to reducing the relevance score of each document by a logarithmic factor proportional to its position in the result list. This reflects the decreasing likelihood of a user checking each subsequent result as they move down the list. Formula is as follows:

where p is the rank position, relᵢ​ is the relevance score of the result at position 𝑖 and log⁡₂(𝑖+1) is the logarithm base 2 of (𝑖+1), which serves as the discount factor. To compute relᵢ​, we simply use binary relevance 1 for a relevant document and 0 for a non-relevant document.

Compared to MRR, DCG offers a more comprehensive view of search quality as it considers the relevance of multiple results rather than just the top one. MRR and DCG offer different perspectives on search performance with MRR focusing on the accuracy of the top result, while DCG takes into account the relevance of the entire result list. By monitoring both metrics, you can achieve a detailed understanding of the effectiveness of your retrieval strategy.

e) NDCG (Normalized Discounted Cumulative Gain)

while DCG is a measure of the total relevance of a ranked list, NDCG is a normalized version of DCG that allows for comparison across different lists. NDCG is generally preferred over DCG because it provides a more standardized and interpretable metric for evaluating ranking systems. The formula for NDCG is as below:

where:

  • DCG𝑝 is the DCG value at position 𝑝 using the original formula.
  • IDCG𝑝​ is the Ideal DCG, which is the maximum possible DCG value at position 𝑝 if all results were perfectly ordered by relevance. Formula for IDCG𝑝 is shown below:

Here, we sort the relevance scores sorted in descending order.

For our use case, we measure NDCG for patterns I and II. For pattern I, the average NDCG stands around 64%. With the application of filters, this increases to around 91%.

f) Average Precision (AP)

AP measures how well a system, like a search engine, ranks relevant items. It considers both how many relevant items are found and how high they are ranked. Let’s say you requested the top 5 instructional videos on how to craft a paper airplane, and here’s the order of the videos you get:

  1. Video A: Perfect instructions (relevant)
  2. Video B: Totally unrelated (not relevant)
  3. Video C: Decent instructions (relevant)
  4. Video D: Another unrelated video (not relevant)
  5. Video E: Great instructions (relevant)

To calculate AP, we’ll look at the precision at each point where a relevant video is found:

  • After video A: 1/1 = 100%
  • After video C: 2/3 = 66.7%
  • After video E: 3/5 = 60%

Now we average these precision values: (100% + 66.7% + 60%) / 3 = 75.6%

So the AP for this search result is 75.6%. This means that, on average, you’re getting relevant results pretty early on in your search. While, Precision @ 5 only focuses on the relevance of the first 5 results (60% in this case), AP takes into account the order in which relevant videos are found, rewarding higher positions for relevant videos. This gives a more nuanced picture of how well the search engine is performing in finding relevant results for you. Formula to compute AP is as follows:

where

  • H is the set of positions of relevant documents.
  • ∣𝐻∣ is the number of relevant documents.
  • 𝑃(𝑖) is the precision at position 𝑖. P(i) for a relevant document at position 𝑖 is 1/i.

g) Mean Average Precision (MAP)

MAP extends the concept of Average Precision (AP) to evaluate multiple searches or queries. While AP measures how well a single search ranks relevant items, MAP averages the AP scores across several searches, giving an overall performance measure for a system over multiple queries. It’s like getting an average grade across multiple tests instead of just one.

While MAP is often favored for its ability to handle complex scenarios with multiple relevant documents per query, it undergoes an interesting simplification when each query has only one relevant result. When dealing with single-relevant-document queries, MAP essentially boils down to the following:

  1. Average Precision becomes Precision: AP for each query is simply the precision achieved at the rank where the single relevant document appears.
  2. Precision equals Reciprocal Rank: Since there’s only one relevant document, precision is the inverse of its rank (e.g., if the document is at rank 3, precision is 1/3). This is exactly the value used in MRR calculations.
  3. MAP Mirrors MRR: MAP, being the average of these AP values across all queries, ends up averaging the reciprocal ranks of the relevant documents. This is precisely what MRR does. Given this overlap, using MAP does not provide additional insight beyond what MRR already offers in our specific setup. MAP is generally more informative in scenarios involving multiple relevant documents per query (multi-hop question answering), where it can provide a nuanced view of how well all relevant documents are retrieved across different ranks.

In our experiments for patterns I and II, MAP essentially equals MRR. It’s stands at 64% for pattern I and increases to about 91% for pattern II.

All supporting code needed to replicate the above evaluations can be found here.

Evaluating Answers

In the next phase, let’s focus on evaluating the answer generation component of our RAG pipelines. Given the availability of ground truth answers for each question, our objective is to assess the quality of the generated answers, ensuring they are both accurate and semantically similar to the expected responses in our test set.

To achieve this, we could employ two distinct metrics. The first metric quantifies semantic similarity by utilizing cosine similarity, a measure derived from sklearn.metrics.pairwise. The second metric leverages the power of LLM as a judge. By presenting both the generated and human-generated answers to the LLM, we can evaluate the degree to which the model’s output aligns with human expectations. This approach allows us to assess the factual accuracy and overall coherence of the generated answers, ensuring they are indistinguishable from human-authored responses.

a) Semantic similarity

This metric measures the cosine of the angle between vector representations of the generated and expected answers. It refers to the assessment of semantic likeness between the generated response and the ground truth (expected answer). This evaluation, based on the ground truth and the answer, yields values ranging from 0 to 1. Higher scores indicate better alignment between the generated response and the ground truth.

Both the expected and generated answers are encoded using the text embedding model text-embedding-003.This model is available via the Vertex AI API. You can find the code implementation for semantic similarity here.

b) Factual Correctness

To assess the accuracy of the generated answers in our previous RAG pipelines , we leverage Gemini for comparison. The question, expected answer (ground truth), and generated answer are passed to Gemini. A prompt template guides Gemini to categorize the generated answer as “correct” (fully aligned with the expected answer), “partially correct” (containing some accurate information but incomplete or partially incorrect), or “incorrect” (not aligned with the expected answer). This categorization allows for granular evaluation of answer quality, identifying areas where answer generation models may need refinement. Implementation details for this validation process can be found here.

Given the question, expected and generated answers as shown below, compare the
answers and classify them into one of the three classes - `correct`,
`partially correct`, or `incorrect`.

If the answer is partially correct or incorrect, provide the rationale.
The output should be two things - class and rationale as a Python dictionary.
For class, it should be one word ONLY (which is the expected class), and for
rationale, provide the reason succinctly, especially ONLY focusing on numbers
and facts.

DO NOT focus on the semantics between the expected and predicted answers.

IMPORANT: Compare only numbers and facts.

If the units are different, normalize them before comparing.
E.g., 1 billion = 1000 million.

Question: {question}

Expected Answer: {expected_ans}xa

Predicted Answer: {predicted_ans}

Compare the predicted answer with the expected answer.
Determine if the predicted answer is factually correct and satisfies the
given question.

Provide your response in the following format:
{format_instructions}

The figure below shows the overall answer accuracy, comparing all four patterns (RAG pipelines) we previously experimented with. To compute the accuracy, we assign a score of 1.0 to fully correct answers, 0.5 to partially correct answers, and 0 to those that are incorrectly classified as per the LLM factual correctness output. The figure reveals that Pattern IV, which leverages filtered search with an external LLM pass to generate answers using extractive methods, outperforms all other approaches, with an accuracy nearing 70%.

We can also break down the distribution of classes across the four different approaches (pipelines) to gain a better understanding of how improvements gradually occur with enhancements.

The box plots below show the distribution of semantic similarity scores across different classes (correct, partially correct, incorrect) for the four different document question-answering RAG pipelines we previously created. The x-axis represents the different classes, while the y-axis signifies the semantic similarity score, ranging from 0 (no similarity) to 1 (perfect similarity). These box plots display the median, quartiles, and range of the semantic similarity scores within each class. Overall, the figure provides a concise and informative visual representation of the performance of various question-answering approaches in terms of semantic similarity.

Distribution of Semantic Similarity Scores by Class

The distribution reveals distinct patterns:

  1. OOB (Pattern I): Shows a wide range of scores across all classes, with a higher concentration of partially correct answers. This suggests that the out-of-the-box (OOB) approach struggles to consistently produce accurate answers on our test set.
  2. OOB + Filters (Pattern II): There’s a noticeable improvement over the O OB approach, particularly in reducing incorrect answers. The distribution leans more towards higher similarity scores, indicating increased accuracy after applying filters. An interesting observation for the incorrect category is that the semantic similarity variance decreases and centers around 0.5 to 0.6 for Pattern II, compared to 1. This is a reduction from Pattern I, where the range of semantic similarity scores was wider, from 0.5 to 1.
  3. Extractive Segments (Pattern III): Demonstrates a further improvement, with a higher proportion of correct and partially correct answers compared to the previous two approaches. This suggests that extracting relevant segments from the context is a more effective strategy than relying solely on the OOB model or basic filtering.
  4. Extractive Answers (Pattern IV): Achieves the best overall performance, with the highest concentration of correct answers and the fewest incorrect ones. This indicates that extracting complete answers directly from the context leads to the most semantically similar and accurate responses.

So far, we’ve discussed how to evaluate retrieval and answer generation, the two main phases of a RAG pipeline. For retrieval, we focused on evaluating the relevance of the retrieved documents. This could be extended further to evaluate the relevance of the pages or the retrieved context. However, you need to ensure you have the ground truth information for this.

For answer quality, we can also use other open-source alternative frameworks like Ragas, or even leverage the Rapid Evaluation API from Vertex AI, if applicable. The Rapid Evaluation service allows you to evaluate your LLMs, both pointwise and pairwise, across several metrics. You can provide inference-time inputs, LLM responses, and additional parameters, and the service returns metrics specific to the evaluation task.

Conclusion

In this guide, we explored the use of Vertex AI Search in creating enterprise-grade RAG pipelines within the financial domain. We detailed the ingestion and indexing of a financial dataset, and leveraged these indexes to perform searches using various methods. We accessed different context types provided by Vertex AI Search and created four distinct RAG pipelines for comparison. We also examined alternative pipeline configurations.

We then evaluated retrieval performance metrics and answer quality assessment techniques. By comparing results, we gleaned valuable insights into the effectiveness of different approaches.

Our primary finding is that Vertex AI Search offers a comprehensive set of functions for building both standard and fully customizable RAG solutions. This platform significantly streamlines information retrieval and question-answering tasks within any chosen domain. In future posts, we will explore other patterns involving unexplored functionalities of Vertex AI.

As a recommendation, to fully understand the content of this guide, set up the shared code repository. Follow the instructions in our work environment and replicate the experiments and results. This way, you can easily train yourself to adapt it for your use cases and potentially extend it!

Thanks for reading the article and your engagement. Your follow and claps mean a lot. If you have any questions about the article or the shared source code, feel free to reach out to me at arunpshankar@google.com or shankar.arunp@gmail.com. You can also find me on https://www.linkedin.com/in/arunprasath-shankar/

I welcome all feedback and suggestions. If you’re interested in large-scale ML, NLP/NLU, and eager to collaborate, I’d be delighted to connect. Moreover, if you’re an individual, startup, or enterprise seeking to comprehend Google Cloud, VertexAI, and the various Generative AI components and their applications in NLP/ML, I’m here to assist. Please feel free to ping me on LinkedIn!

--

--

Arun Shankar
Google Cloud - Community

Global Lead Architect at Google. Ex-AWS. 13+ yrs exp in NLP/NLU, seasoned speaker & author, passionate about tackling intricate NLU challenges at scale.