Agentic RAG With Llama-index | Router Query Engine #01

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

10 min readMay 14, 2024

Tired of the good old RAG (Retrieval Augmented Generation) systems we have extensively covered in my blog posts? Well, I am tired of them. Let’s do something fun to take things to the next level. Let’s go over building your own Agentic RAG systems, introducing the idea of agents into a well defined RAG system workflow.

Last year the buzz word was all about RAG systems, this year things have taken a turn, it’s all about agents now. If you miss the RAG buss word era, well it’s fine because we can introduce agents into RAG systems as well. Good thing is it’s even better.

In this article, we’ll go over how to implement a basic Agentic RAG application using Llama-index. This is the first article in a series of articles I’ll be posting in the upcoming weeks on Agentic RAG architectures.

Basic Retrieval Augemented Generation (RAG) Pipeline

Before we move on, I just want to give a quick refresher on what a traditional RAG architecture looks like and works. This knowledge will be useful later on and also to beginners who don’t know how a basic RAG pipeline works.

From the above image of a simple RAG system, we simply have the following that we work with:

Documents: This is the context that you want to augment your LLM with external information that it feeds into an LLM. This could be a PDF or any other text document or even images for a multimodal LLM.
Chunks: The larger Document is broken down into smaller sizes that are typically called chunks sometimes also called nodes.
Embeddings: Once we have the smaller sized chunks, we create vector embeddings for them. Once a user query is received, a similarity search is performed and the most similar document(s) are retrieved, the retrieval part of RAG. These retrieved chunks of document are sent alongside the user query to the LLM as, with the retrieved document(s) acting as the context. From this, a response is generated by the LLM.

The above explanation is how a typical traditional RAG system works.

Why Agentic RAG

We have seen the implementation of a simple RAG from above, this approach is suitable for simple QA tasks over one or few documents. Not suitable for complex QA tasks and summarization over larger sets of documents.

This is where agents can come into play, to help take the simple RAG implementation to a whole other new level. With agentic RAG systems, more complex tasks such as document summarization, complex QA and a host of other tasks can be carried much more easily. Agentic RAG also gives you the ability to incorporate tool calling into your RAG system and these tools can be custom functions that you define yourself.

In this series of articles, we’ll go over the following:

Router Query Engines: This is the simplest form a an agentic RAG. This gives use the ability to add logic statements that can help LLM decide on which route to route a specific task depending on the task(s) that need to be carried out and the set of tools we made available to the LLM.
Tool Calling: Here we’ll go over how to add our own custom tools to the agentic RAG architecture. Here we implement interfaces for agents to select one tool from a host of tools we’ll provide to them and then let the LLM provide the arguments needed to be passed to call these tools as these tools are simply Python functions, at least the ones you have defined yourself.
Agentic RAG With Multi-step Reasoning Capabilities:
Agentic RAG With Multi-step Reasoning Capabilities With Multiple Documents

Router Query Engine

This is the most simplest form of agentic RAG in Llama-index at least. In this approach we simply have a router engine that, with the help of an LLM, determines what tool or query engine to use to address a given user query.

This is the basic implementation of how a router query engine works.

Project Environment Setup

To setup your development environment, create a folder called agentic_rag , inside of this folder, create another folder called basics . Once done, navigate into the basics folder and initialize a Python Poetry project

$ poetry init

To get started, make sure you have your OpenAI API key ready, you can get your key from here if you don’t already have it. Once you have your api key ready, add it to your .env file:

OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

So where is this .env file? Well I created an development env setup as follows:

Follow this directory structure and add in your files as shown in the images above.

Installing Packages

We’ll use Llama-index for this. Let’s install it along with some other libraries we’ll make use of:

$ poetry add python-dotenv ipykernel llama-index nest_asyncio

Downloading Dataset

We’ll need a PDF file to experiment with. You can download this PDF from here. Again, feel free to use any PDF file of your liking.

Load And Spitting Document Into Nodes

Now we are ready to get started, let’s first load in our environment variables using the python-dotenv library we just installed:

import dotenv
%load_ext dotenv
%dotenv

We’ll also use the nest-asyncio library since Llama-index uses a lot of asyncio functionality in the background:

import nest_asyncio
nest_asyncio.apply()

Now, let’s load in our data:

from llama_index.core import SimpleDirectoryReader

# load lora_paper.pdf documents
documents = SimpleDirectoryReader(input_files=["./datasets/lora_paper.pdf"]).load_data()

Creating Document Chunks

Once we have the data loaded successfully, let’s move ahead to break the largest document down into chunks of 1024 chunk sizes:

from llama_index.core.node_parser import SentenceSplitter

# chunk_size of 1024 is a good default value
splitter = SentenceSplitter(chunk_size=1024)
# Create nodes from documents
nodes = splitter.get_nodes_from_documents(documents)

We can get more info about each of these nodes using:

node_metadata = nodes[1].get_content(metadata_mode=True)
print(node_metadata)

Creating LLM And Embedding Models

We’ll use the OpenAI gpt-3.5-turbo model as the LLM and the text-embedding-ada-002 embedding model to create the embeddings.

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# LLM model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
# embedding model
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

Creating indexes

As shown in the images earlier on, we’ll have two main indexes that we’ll use:

Summary Index: I got this explanation from the official Llamaindex docs:

The summary index is a simple data structure where nodes are stored in a sequence. During index construction, the document texts are chunked up, converted to nodes, and stored in a list.
During query time, the summary index iterates through the nodes with some optional filter parameters, and synthesizes an answer from all the nodes.

2. Vector Index: This is just a regular index store created from word embeddings from which we can perform similarity searches to get the n most similar index.

We can use the code below to create these two indexes:

from llama_index.core import SummaryIndex, VectorStoreIndex

# summary index
summary_index = SummaryIndex(nodes)
# vector store index
vector_index = VectorStoreIndex(nodes)

Turning Vector Indexes To Query Engines

Once that we now have the vector indexes created and stored, we’ll now need to move ahead to creating the query engines that we’ll convert to tools aka query tools that our agents can use later on.

# summary query engine
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)

# vector query engine
vector_query_engine = vector_index.as_query_engine()

In the case above, we have two different query engines. Each of these query engines we’ll place under a router query engine that will then decide what query engine to route to depending on the user query.

In the above code, we are specifying the use_async parameter for faster querying, this is one of the reasons we also had to use the next_asyncio library.

Query Tools

A query tool is simply a query engine with metadata, specifically a description of what the query tool can be used for or is for. This helps the router query engine to then be able to decide what query engine tool to route to depending on the query it receives.

from llama_index.core.tools import QueryEngineTool


summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description=(
        "Useful for summarization questions related to the Lora paper."
    ),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=(
        "Useful for retrieving specific context from the the Lora paper."
    ),
)

Router Query Engine

Finally, we can go on ahead to creating the router query engine tool. This will enable us to use all the query tools we created from the query engines we defined above, specifically the summary_tool and the vector_tool .

from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector


query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
    verbose=True
)

LLMSingleSelector: This is a selector that uses the LLM to select a single choice from a list of choices. You can read more about it from here.

Testing Out The Router Query Engine

Let’s go ahead and use the following piece of code to test out the router query engine:

response = query_engine.query("What is the summary of the document?")
print(str(response))

Above is the summary of the paper that is summarized over all the context in the given Lora-paper we passed on to the summarization query engine

Since we are using the summary index that stores all nodes in a sequential list, all nodes are visited and a general summary is generated from all the nodes to get the final summary.

You can confirm this by checking the length of the response, the source_nodes attribute returns to us the sources used to generate the summary.

print(len(response.source_nodes))

You can notice the number 38 is the same as the number of the nodes we got after performing document chunking. This means all the chunked nodes have been used to generate the summary.

Let’s ask another question that does not involve the use of the summary tool.

response = query_engine.query("What is the long from of Lora?")
print(str(response))

This uses the vector index tool, the response is not so accurate nevertheless.

Putting It All Together

Now that we have understood this basic pipeline, let’s move ahead into converting this into a pipeline function that we call utilize later.

async def create_router_query_engine(
    document_fp: str,
    verbose: bool = True,
) -> RouterQueryEngine:
    # load lora_paper.pdf documents
    documents = SimpleDirectoryReader(input_files=[document_fp]).load_data()
    
    # chunk_size of 1024 is a good default value
    splitter = SentenceSplitter(chunk_size=1024)
    # Create nodes from documents
    nodes = splitter.get_nodes_from_documents(documents)
    
    # LLM model
    Settings.llm = OpenAI(model="gpt-3.5-turbo")
    # embedding model
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
    
    # summary index
    summary_index = SummaryIndex(nodes)
    # vector store index
    vector_index = VectorStoreIndex(nodes)
    
    # summary query engine
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True,
    )

    # vector query engine
    vector_query_engine = vector_index.as_query_engine()
    
    summary_tool = QueryEngineTool.from_defaults(
        query_engine=summary_query_engine,
        description=(
            "Useful for summarization questions related to the Lora paper."
        ),
    )

    vector_tool = QueryEngineTool.from_defaults(
        query_engine=vector_query_engine,
        description=(
            "Useful for retrieving specific context from the the Lora paper."
        ),
    )
    
    
    query_engine = RouterQueryEngine(
        selector=LLMSingleSelector.from_defaults(),
        query_engine_tools=[
            summary_tool,
            vector_tool,
        ],
        verbose=verbose
    )
    
    
    return query_engine

We can then call this function as so:

query_engine = await create_router_query_engine("./datasets/lora_paper.pdf")
response = query_engine.query("What is the summary of the document?")
print(str(response))

Let’s move on ahead and create a utils.py file and have the following inside of it:

from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool
from llama_index.core import SummaryIndex, VectorStoreIndex
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

async def create_router_query_engine(
    document_fp: str,
    verbose: bool = True,
) -> RouterQueryEngine:
    # load lora_paper.pdf documents
    documents = SimpleDirectoryReader(input_files=[document_fp]).load_data()
    
    # chunk_size of 1024 is a good default value
    splitter = SentenceSplitter(chunk_size=1024)
    # Create nodes from documents
    nodes = splitter.get_nodes_from_documents(documents)
    
    # LLM model
    Settings.llm = OpenAI(model="gpt-3.5-turbo")
    # embedding model
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
    
    # summary index
    summary_index = SummaryIndex(nodes)
    # vector store index
    vector_index = VectorStoreIndex(nodes)
    
    # summary query engine
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True,
    )

    # vector query engine
    vector_query_engine = vector_index.as_query_engine()
    
    summary_tool = QueryEngineTool.from_defaults(
        query_engine=summary_query_engine,
        description=(
            "Useful for summarization questions related to the Lora paper."
        ),
    )

    vector_tool = QueryEngineTool.from_defaults(
        query_engine=vector_query_engine,
        description=(
            "Useful for retrieving specific context from the the Lora paper."
        ),
    )
    
    
    query_engine = RouterQueryEngine(
        selector=LLMSingleSelector.from_defaults(),
        query_engine_tools=[
            summary_tool,
            vector_tool,
        ],
        verbose=verbose
    )
    
    
    return query_engine

We can then utilize this function call from this file later on:

from utils import create_router_query_engine

query_engine = await create_router_query_engine("./datasets/lora_paper.pdf")
response = query_engine.query("What is the summary of the document?")
print(str(response))

Conclusion

Congratulations for making it this far. That’s all we’ll cover in this article, in the next article, we’ll go over how to use a Tool Calling aka Function Calling.

Other platforms where you can reach out to me:

Happy coding! And see you next time, the world keeps spinning.

Agentic RAG With Llama-index | Router Query Engine #01

Basic Retrieval Augemented Generation (RAG) Pipeline

Why Agentic RAG

In this series of articles, we’ll go over the following:

Router Query Engine

Project Environment Setup

Installing Packages

Downloading Dataset

Load And Spitting Document Into Nodes

Creating Document Chunks

Creating LLM And Embedding Models

Creating indexes

Turning Vector Indexes To Query Engines

Query Tools

Router Query Engine

Testing Out The Router Query Engine

Putting It All Together

Conclusion

References

Written by Prince Krampah