Building and Evaluating Basic and Advanced RAG Applications with LlamaIndex and Gemini-pro in Google Cloud — Part 2

Ishmeet Mehta
Google Cloud - Community
7 min readMar 23, 2024

Let’s look at some advanced RAG retrieval strategies that can help improve the retrieval performance for our RAG pipeline.

The standard RAG pipeline uses the same text chunk for embedding and synthesis. The issue with this approach is that the embedding-based retrieval works well with smaller chunks whereas LLM needs more context and bigger chunks to synthesize a good answer.

There are two advanced retrieval techniques we are going to explore in this tutorial.

Sentence-window Retrieval

In this method, we retrieve based on smaller sentences to better match the relevant context and then synthesize based on the expanded context window around the sentence. We first embed smaller chunks or sentences and store them in a vector database.

We also add the context of sentences that occur before and after each chunk. During retrieval, we retrieve the sentences that are more relevant to the question with the similarity search and then replace them with the full surrounding context. This allows us to expand the context that is being fed to the LLM.

Let's check how to set this up in our Colab notebook.

This notebook assumes you have gone through the prerequisites mentioned in Part 1 for this setup.

Step 1. Import the Gemini Pro model

from llama_index.llms import Gemini

llm = Gemini(model="models/gemini-pro", temperature=0.1)

Step 2. Now we set up the document we use in part 1 . Also, we ingest the pdf, chunk it into smaller sizes, create embeddings using our model, and index using the Sentence Window Retrieval Index.

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

Next, we would merge all 41 pages of the PDF into a single document for text-splitting accuracy when using more advanced retrieval methods.

from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

Step 3. We would build the index using the Sentence Window Index over a given document. We are going to define the LLM and embedding model.

Note: We are using huggingface face embedding model

We are using helper functions from the utils file in the repository to build the sentence window index.

from utils import build_sentence_window_index

sentence_index = build_sentence_window_index(
document,
llm,
embed_model="local:BAAI/bge-small-en-v1.5",
save_dir="sentence_index"
)

Step 4. We will get a query engine from the sentence window index

from utils import get_sentence_window_query_engine

sentence_window_engine = get_sentence_window_query_engine(sentence_index)

Step 5. Now let’s run a basic query against our model to check if the index works as expected.

window_response = sentence_window_engine.query(
"how do I get started on a personal project in AI?"
)
print(str(window_response))

Step 6. Now lets, benchmark our results with Trulens as we did with Basic RAG in part 1. We import the prebuilt trulens recorder for sentence window retriever.

from utils import get_prebuilt_trulens_recorder

tru_recorder_sentence_window = get_prebuilt_trulens_recorder(
sentence_window_engine,
app_id = "Sentence Window Query Engine"
)

Step 7. Now benchmark the set of eval questions the same as part 1. We would run the sentence window retriever on top of these evaluation questions and compare the performance of the RAG triad

with tru_recorder_sentence_window as recording:
for question in eval_questions:
response = sentence_window_engine.query(question)

Step 8. Run the query for the tru leaderboard app IDs, you will see results for both applications. Direct Query Engine(Basic RAG) and Sentence Window Query Engine( Advanced RAG option 1 ).

tru.get_leaderboard(app_ids=[])

Auto-Merging Retrieval

Another issue with a naive approach is that you are retrieving a bit of fragmented context chunks to put into the LLM context window. The fragmentation gets worse with a smaller chunk size. For instance, you might get back two or more retrieved-context chunks roughly in the same section, however, there are no guarantees on the ordering of these chunks.

This can potentially impact the LLM’s ability to synthesize the information over this retrieved context within its context window.

This is how auto-merging retrieval solves this problem:

  1. It first defines the hierarchy of smaller chunks linked to parent chunks.
  2. If the set of smaller chunks linking to a parent chunk exceeds some threshold then “merge” smaller chunks into a bigger parent chunk. So overall we retrieved a bigger parent chunk for a more coherent context for the LLM.

Let’s check how to set this up in our Colab notebook.

This notebook assumes you have gone through the prerequisites mentioned in Part 1 for this setup.

We would set up our notebook for Automerging retrieval and later compare the performance with Trulens.

Step 1. Similar to other notebooks we will import the Gemini models.

import warnings
warnings.filterwarnings('ignore')
from llama_index.llms import Gemini

llm = Gemini(model="models/gemini-pro", temperature=0.1)

Step 2. Now we set up the document we use in part 1. Also, we ingest the pdf, chunk it into smaller sizes, create embeddings using our model, and index using the Automerging Retrieval Index.

Note: As mentioned in part 1, you can try this example by loading your pdf in the step below.

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

Next, we would merge all 41 pages of the PDF into a single document for text-splitting accuracy when using more advanced retrieval methods.

from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

Now we can define our auto-merging retriever in a few steps below.

Step 4. Define a hierarchical node parser

To use the auto-merging retriever we need to parse hierarchically our nodes(parent and child). Nodes are parsed in decreasing sizes and contain relationships to their parent nodes.

We created a node_parser with three chunk sizes. You can change chunk sizes to any factor you would like to use. Here I have used chunk size decreasing by a factor of 4.

from llama_index.node_parser import HierarchicalNodeParser

# create the hierarchical node parser w/ default settings
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)

Step 5. This returns all leaf, intermediate, and parent nodes of our document. There would be information overlap between the leaf, parent, and intermediate nodes.

nodes = node_parser.get_nodes_from_documents([document])

Step 6. If we are only interested in the leaf nodes we can retrieve them by using the ```get_leaf_nodes``` function.

from llama_index.node_parser import get_leaf_nodes

leaf_nodes = get_leaf_nodes(nodes)
print(leaf_nodes[30].text)

Step 7. You explore the relationship between parent and leaf by printing out the parent node. A single parent node is 512 tokens [ 4 leaf nodes] compared to a leaf which is 128 tokens.

nodes_by_id = {node.node_id: node for node in nodes}

parent_node = nodes_by_id[leaf_nodes[30].parent_node.node_id]
print(parent_node.text)

Step 8. After defining our node hierarchy we can construct our auto-merging retrieval index.

We are using the Gemini-pro model as the LLM and huggingface bge-small model for embeddings. We are wrapping all this information in a service context object that contains LLM, an embedding model, and the hierarchical node parser.

from llama_index import ServiceContext
from llama_index.llms import Gemini


llm = Gemini(model="gemini-pro", temperature=0.1)

auto_merging_context = ServiceContext.from_defaults(
llm=llm,
embed_model="local:BAAI/bge-small-en-v1.5",
node_parser=node_parser,
)

Step 9. Construct the Index for Auto-merging retriever.

We construct the VectorIndex specifically on the leaf nodes of our documents. All other nodes are stored as a storage context object in the doc store(in-memory) and retrieved dynamically as needed by the index queries later.

We initially fetch in the top K which are leaf nodes that we embed using the embedding model.

In this VectorStore index, we pass in storage_context and service_context.

Storage_context has information about nodes(parent and intermediate) and service_context knows which LLM, embedding model, and parser to use to create an index on leaf nodes. We persist this index on the disk.

from llama_index import VectorStoreIndex, StorageContext
from llama_index import set_global_service_context

set_global_service_context(auto_merging_context)

storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

automerging_index = VectorStoreIndex(
leaf_nodes, storage_context=storage_context, service_context=auto_merging_context
)

automerging_index.storage_context.persist(persist_dir="./merging_index")

Step 10. The last step is to set up the retriever and run the query engine.

Note: If the majority of child nodes are retrieved for a query they will be swapped out by the parent node.

For auto-merging retrieval to work well, we set a large top K for the leaf nodes[128 tokens]. We apply the re-ranker after the merging has taken place to reduce the token usage.

For example, We might retrieve the top 12, merge and have a top 10, and then re-rank into a top 6.

The top N for the retrieve seems larger but as you can see the base size for a leaf node is only 128 tokens and the parent is 512 tokens. We combine the auto-merging retriever and the rerank module to create an auto_merging engine to run our queries.

from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.llms import Gemini


automerging_retriever = automerging_index.as_retriever(
similarity_top_k=12
)

retriever = AutoMergingRetriever(
automerging_index.service_context,
automerging_retriever,
automerging_index.storage_context,
verbose=True
)

rerank = SentenceTransformerRerank(top_n=6, model="BAAI/bge-reranker-base")

auto_merging_engine = RetrieverQueryEngine.from_args(
automerging_retriever, node_postprocessors=[rerank], verbose=True
)

Step 12. Let’s test an eval question with this query engine.

auto_merging_response = auto_merging_engine.query(
"How to I build a portfolio of AI projects??"
)
print(str(auto_merging_response))

Step 13. Now let’s import the trulens recorder from the utils module as shown in part 1.


from utils import get_prebuilt_trulens_recorder

tru_recorder_automerging = get_prebuilt_trulens_recorder(auto_merging_engine,
app_id="Automerging Query Engine")

Step 14. Let's benchmark the set of eval questions same as part 1. Now we would run an auto-merging retriever on top of these evaluation questions and compare the performance on the RAG triad

for question in eval_questions:
with tru_recorder_automerging as recording:
response = auto_merging_engine.query(question)

Step 15. When you query the tru leaderboard app IDs, you will see results for all three applications. Direct Query Engine(Basic RAG), Sentence Window Query Engine( Advanced RAG option 1), and Automerging Query Engine( Advanced RAG option 2 ).

tru.get_leaderboard(app_ids=[])

--

--

Ishmeet Mehta
Google Cloud - Community

Enterprise Cloud Architect Google Cloud - Machine Learning and Generative AI Developer