Agentic RAG With Llama-index | Router Query Engine #01
Tired of the good old RAG (Retrieval Augmented Generation) systems we have extensively covered in my blog posts? Well, I am tired of them. Let’s do something fun to take things to the next level. Let’s go over building your own Agentic RAG systems, introducing the idea of agents into a well defined RAG system workflow.
Last year the buzz word was all about RAG systems, this year things have taken a turn, it’s all about agents now. If you miss the RAG buss word era, well it’s fine because we can introduce agents into RAG systems as well. Good thing is it’s even better.
In this article, we’ll go over how to implement a basic Agentic RAG application using Llama-index. This is the first article in a series of articles I’ll be posting in the upcoming weeks on Agentic RAG architectures.
Basic Retrieval Augemented Generation (RAG) Pipeline
Before we move on, I just want to give a quick refresher on what a traditional RAG architecture looks like and works. This knowledge will be useful later on and also to beginners who don’t know how a basic RAG pipeline works.
From the above image of a simple RAG system, we simply have the following that we work with:
- Documents: This is the context that you want to augment your LLM with external information that it feeds into an LLM. This could be a PDF or any other text document or even images for a multimodal LLM.
- Chunks: The larger Document is broken down into smaller sizes that are typically called chunks sometimes also called nodes.
- Embeddings: Once we have the smaller sized chunks, we create vector embeddings for them. Once a user query is received, a similarity search is performed and the most similar document(s) are retrieved, the retrieval part of RAG. These retrieved chunks of document are sent alongside the user query to the LLM as, with the retrieved document(s) acting as the context. From this, a response is generated by the LLM.
The above explanation is how a typical traditional RAG system works.
Why Agentic RAG
We have seen the implementation of a simple RAG from above, this approach is suitable for simple QA tasks over one or few documents. Not suitable for complex QA tasks and summarization over larger sets of documents.
This is where agents can come into play, to help take the simple RAG implementation to a whole other new level. With agentic RAG systems, more complex tasks such as document summarization, complex QA and a host of other tasks can be carried much more easily. Agentic RAG also gives you the ability to incorporate tool calling into your RAG system and these tools can be custom functions that you define yourself.
In this series of articles, we’ll go over the following:
- Router Query Engines: This is the simplest form a an agentic RAG. This gives use the ability to add logic statements that can help LLM decide on which route to route a specific task depending on the task(s) that need to be carried out and the set of tools we made available to the LLM.
- Tool Calling: Here we’ll go over how to add our own custom tools to the agentic RAG architecture. Here we implement interfaces for agents to select one tool from a host of tools we’ll provide to them and then let the LLM provide the arguments needed to be passed to call these tools as these tools are simply Python functions, at least the ones you have defined yourself.
- Agentic RAG With Multi-step Reasoning Capabilities:
- Agentic RAG With Multi-step Reasoning Capabilities With Multiple Documents
Router Query Engine
This is the most simplest form of agentic RAG in Llama-index at least. In this approach we simply have a router engine that, with the help of an LLM, determines what tool or query engine to use to address a given user query.
This is the basic implementation of how a router query engine works.
Project Environment Setup
To setup your development environment, create a folder called agentic_rag
, inside of this folder, create another folder called basics
. Once done, navigate into the basics
folder and initialize a Python Poetry project
$ poetry init
To get started, make sure you have your OpenAI API key ready, you can get your key from here if you don’t already have it. Once you have your api key ready, add it to your .env
file:
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So where is this .env
file? Well I created an development env setup as follows:
Follow this directory structure and add in your files as shown in the images above.
Installing Packages
We’ll use Llama-index for this. Let’s install it along with some other libraries we’ll make use of:
$ poetry add python-dotenv ipykernel llama-index nest_asyncio
Downloading Dataset
We’ll need a PDF file to experiment with. You can download this PDF from here. Again, feel free to use any PDF file of your liking.
Load And Spitting Document Into Nodes
Now we are ready to get started, let’s first load in our environment variables using the python-dotenv
library we just installed:
import dotenv
%load_ext dotenv
%dotenv
We’ll also use the nest-asyncio
library since Llama-index uses a lot of asyncio functionality in the background:
import nest_asyncio
nest_asyncio.apply()
Now, let’s load in our data:
from llama_index.core import SimpleDirectoryReader
# load lora_paper.pdf documents
documents = SimpleDirectoryReader(input_files=["./datasets/lora_paper.pdf"]).load_data()
Creating Document Chunks
Once we have the data loaded successfully, let’s move ahead to break the largest document down into chunks of 1024 chunk sizes:
from llama_index.core.node_parser import SentenceSplitter
# chunk_size of 1024 is a good default value
splitter = SentenceSplitter(chunk_size=1024)
# Create nodes from documents
nodes = splitter.get_nodes_from_documents(documents)
We can get more info about each of these nodes using:
node_metadata = nodes[1].get_content(metadata_mode=True)
print(node_metadata)
Creating LLM And Embedding Models
We’ll use the OpenAI gpt-3.5-turbo
model as the LLM and the text-embedding-ada-002
embedding model to create the embeddings.
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# LLM model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
# embedding model
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Creating indexes
As shown in the images earlier on, we’ll have two main indexes that we’ll use:
- Summary Index: I got this explanation from the official Llamaindex docs:
The summary index is a simple data structure where nodes are stored in a sequence. During index construction, the document texts are chunked up, converted to nodes, and stored in a list.
During query time, the summary index iterates through the nodes with some optional filter parameters, and synthesizes an answer from all the nodes.
2. Vector Index: This is just a regular index store created from word embeddings from which we can perform similarity searches to get the n
most similar index.
We can use the code below to create these two indexes:
from llama_index.core import SummaryIndex, VectorStoreIndex
# summary index
summary_index = SummaryIndex(nodes)
# vector store index
vector_index = VectorStoreIndex(nodes)
Turning Vector Indexes To Query Engines
Once that we now have the vector indexes created and stored, we’ll now need to move ahead to creating the query engines that we’ll convert to tools aka query tools that our agents can use later on.
# summary query engine
summary_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
# vector query engine
vector_query_engine = vector_index.as_query_engine()
In the case above, we have two different query engines. Each of these query engines we’ll place under a router query engine that will then decide what query engine to route to depending on the user query.
In the above code, we are specifying the use_async
parameter for faster querying, this is one of the reasons we also had to use the next_asyncio
library.
Query Tools
A query tool is simply a query engine with metadata, specifically a description of what the query tool can be used for or is for. This helps the router query engine to then be able to decide what query engine tool to route to depending on the query it receives.
from llama_index.core.tools import QueryEngineTool
summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_query_engine,
description=(
"Useful for summarization questions related to the Lora paper."
),
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description=(
"Useful for retrieving specific context from the the Lora paper."
),
)
Router Query Engine
Finally, we can go on ahead to creating the router query engine tool. This will enable us to use all the query tools we created from the query engines we defined above, specifically the summary_tool
and the vector_tool
.
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
query_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[
summary_tool,
vector_tool,
],
verbose=True
)
LLMSingleSelector: This is a selector that uses the LLM to select a single choice from a list of choices. You can read more about it from here.
Testing Out The Router Query Engine
Let’s go ahead and use the following piece of code to test out the router query engine:
response = query_engine.query("What is the summary of the document?")
print(str(response))
Above is the summary of the paper that is summarized over all the context in the given Lora-paper we passed on to the summarization query engine
Since we are using the summary index that stores all nodes in a sequential list, all nodes are visited and a general summary is generated from all the nodes to get the final summary.
You can confirm this by checking the length of the response, the source_nodes
attribute returns to us the sources used to generate the summary.
print(len(response.source_nodes))
You can notice the number 38 is the same as the number of the nodes we got after performing document chunking. This means all the chunked nodes have been used to generate the summary.
Let’s ask another question that does not involve the use of the summary tool.
response = query_engine.query("What is the long from of Lora?")
print(str(response))
This uses the vector index tool, the response is not so accurate nevertheless.
Putting It All Together
Now that we have understood this basic pipeline, let’s move ahead into converting this into a pipeline function that we call utilize later.
async def create_router_query_engine(
document_fp: str,
verbose: bool = True,
) -> RouterQueryEngine:
# load lora_paper.pdf documents
documents = SimpleDirectoryReader(input_files=[document_fp]).load_data()
# chunk_size of 1024 is a good default value
splitter = SentenceSplitter(chunk_size=1024)
# Create nodes from documents
nodes = splitter.get_nodes_from_documents(documents)
# LLM model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
# embedding model
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
# summary index
summary_index = SummaryIndex(nodes)
# vector store index
vector_index = VectorStoreIndex(nodes)
# summary query engine
summary_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
# vector query engine
vector_query_engine = vector_index.as_query_engine()
summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_query_engine,
description=(
"Useful for summarization questions related to the Lora paper."
),
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description=(
"Useful for retrieving specific context from the the Lora paper."
),
)
query_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[
summary_tool,
vector_tool,
],
verbose=verbose
)
return query_engine
We can then call this function as so:
query_engine = await create_router_query_engine("./datasets/lora_paper.pdf")
response = query_engine.query("What is the summary of the document?")
print(str(response))
Let’s move on ahead and create a utils.py
file and have the following inside of it:
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool
from llama_index.core import SummaryIndex, VectorStoreIndex
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader
async def create_router_query_engine(
document_fp: str,
verbose: bool = True,
) -> RouterQueryEngine:
# load lora_paper.pdf documents
documents = SimpleDirectoryReader(input_files=[document_fp]).load_data()
# chunk_size of 1024 is a good default value
splitter = SentenceSplitter(chunk_size=1024)
# Create nodes from documents
nodes = splitter.get_nodes_from_documents(documents)
# LLM model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
# embedding model
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
# summary index
summary_index = SummaryIndex(nodes)
# vector store index
vector_index = VectorStoreIndex(nodes)
# summary query engine
summary_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
# vector query engine
vector_query_engine = vector_index.as_query_engine()
summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_query_engine,
description=(
"Useful for summarization questions related to the Lora paper."
),
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description=(
"Useful for retrieving specific context from the the Lora paper."
),
)
query_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[
summary_tool,
vector_tool,
],
verbose=verbose
)
return query_engine
We can then utilize this function call from this file later on:
from utils import create_router_query_engine
query_engine = await create_router_query_engine("./datasets/lora_paper.pdf")
response = query_engine.query("What is the summary of the document?")
print(str(response))
Conclusion
Congratulations for making it this far. That’s all we’ll cover in this article, in the next article, we’ll go over how to use a Tool Calling aka Function Calling.
Other platforms where you can reach out to me:
Happy coding! And see you next time, the world keeps spinning.