RAG IV: Agentic RAG with LlamaIndex

Router Query Engine, Tool Calling, Agent Reasoning Loop, Multi-Document Agent.

Sulaiman Shamasna
29 min readJun 12, 2024

Note: the article is build on top of OpenAI’s short course: Building Agentic RAG with Llamaindex

Note: The code as well as the helping material can be found on my GitHub repo. Also, a demo using this technique can be found on my GitHub repo.

Agentic RAG is a framework that helps building research agents capable of using reasoning and decision making over user’s data. For instance, if you have a set of research papers on a specific topic, and you want to pull out the parts relevant to a question you want to ask and want to get a synthesis of what the papers say. It seems to be so complex that it requires multiple steps of processing, e.g., identifying the theme for one paper, retrieving additional information from other papers about that theme, etc.

In comparison to stranded RAG pipeline, which is so popular, is mostly good for simpler questions over a small set of documents and working with retrieving some context, sticking that into the prompt and then just calling the LLM a single time to get a response.

This article introduces agentic RAG, that will take the idea of chatting over your data to a next level and show you how to build an autonomous research agent. You’ll learn a progression of reasoning ingredients to building a full agent . First, routing; a decision making will be added to route requests to multiple tools. Second, tool use, where an interface for agents is created to select a tool and generate the right arguments for that, too. Finally, multi-step reasoning with tool-use. Here, an LLM is used to perform multiple-step reasoning with a range of tools for retaining memory throughout that process.

You’ll learn how to effectively interact with an agent and use its capabilities for detailed control and oversight. This allows you to not only create a higher level research assistant over your RAG pipeline, but also gives you more effective ways to guide its actions. Specifically, you’ll learn how to ensure debug ability of the LLM, we’ll look at how to step through what your agent is doing and how to use that to improve your agent.

Router Query Engine

In this section, you’ll build a router over a single document that can handle both question answering as well as summarization. The following is a step-by-step approach supported with an illustrative code.

  • Setup environment and load the documents and specify the model
# Working Environment Setup

from helper import get_openai_api_key

OPENAI_API_KEY = get_openai_api_key()

import nest_asyncio

# to make asynch play niche with jupyter notebook
nest_asyncio.apply()

# Load Data/ Documents
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_files=["metagpt.pdf"]).load_data()

# Define LLM and Embedding Model

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents) # Set documets into nodes.

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
  • Define Summary index and Vector Index over the same Data

Now we’re ready to start building some indexes. Here, we define two indexes over the nodes. This includes both a summary index and a vector index. A refresher is that we can think of an index as a set of metadata over our data. You can query and index at different indexes and will have different retrieval behaviors.

Vector Index vs. Summary Index

A vector index, for instance, indexes nodes via text embeddings and its core abstraction in LlamaIndex, and a core abstraction for building any sort of RAG system. Querying a vector index will return the most similar nodes by embedding similarity. A summary index, on the other hand, is also a very simple index, but querying it will return all the nodes currently in the index, so it doesn’t necessarily depend on the user query, but will return all the nodes that are currently in the index.

from llama_index.core import SummaryIndex, VectorStoreIndex

summary_index = SummaryIndex(nodes)
vector_index = VectorStoreIndex(nodes)
  • Define Query Engine and set Metadata

Now these indexes above will be turned into query engines and then query tools. Each query engine represents overall query interface over the data that’s stored in this index, and combines retrieval with LLM synthesis. Each query engine is good for a certain type of question, and this is a great use case of a router, which can route dynamically between these different query entrants.

A query tool is just the query engine with metadata, specifically a description of what types of questions the tool can answer.

summary_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
vector_query_engine = vector_index.as_query_engine()

We can see that the query engine is derived from each of the indexes. We can see that for the summary query engine we set use_async=True to basically enforce faster query generation by leveraging async capabilities. Now a query tool is just a query engine with metadata. It’s specifically a description of what types of questions the tool can answer. And we’ll define a query tool for both the summary and vector query engines.

from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_query_engine,
description=(
"Useful for summarization questions related to MetaGPT"
),
)

vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description=(
"Useful for retrieving specific context from the MetaGPT paper."
),
)

Through the code above you see that the summary tool description is useful for summarization questions related to metaGPT, and the vector tool description is useful for retrieving specific context from the metaGPT paper.

  • Define Router Query Engine — Selectors

Now that we have our query engines and tools, we’re ready to define our router. LlamaIndex provides several different types of selectors to enable you to build a router. And each of these selectors has distinct attributes. The LLM selector is one option, and it involves prompting an LLM to output a json that is then parsed and then the corresponding indexed are queried.

Another option is to use Pydantic selectors, instead of directly prompting the LLM with text, we actually use the function calling APIs supported by models like OpenAI, to produce Pydantic selection objects, rather than parsing raw json.

from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector


query_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[
summary_tool,
vector_tool,
],
verbose=True
)

response = query_engine.query("What is the summary of the document?")
print(str(response))

'''
OUTPUT:
Selecting query engine 0: Useful for summarization questions related to MetaGPT.
The document introduces MetaGPT, a meta-programming framework that enhances
the problem-solving capabilities of multi-agent systems using Large Language
Models (LLMs). It models a group of agents as a simulated software company,
emphasizing role specialization, workflow management, and efficient sharing
mechanisms. MetaGPT incorporates Standardized Operating Procedures (SOPs)
to streamline collaboration, improve code generation quality, and achieve
top-tier performance in evaluations. The framework is utilized in software
development processes, starting from user input to the creation of functional
applications like the "Drawing App." The document also discusses the
performance of GPT models in benchmarks, ethical concerns related to MetaGPT,
and the potential benefits of natural language programming.
'''
response = query_engine.query(
"How do agents share information with other agents?"
)

print(str(response))

'''
OUTPUT:
Selecting query engine 1: This choice is more relevant as it focuses
on retrieving specific context from the MetaGPT paper, which may provide
insights on how agents share information with other agents..

Agents share information with other agents by utilizing a shared message pool.
This shared message pool allows all agents to exchange messages directly.
Agents publish their structured messages in the pool and can also access
messages from other entities transparently. This system enables any agent
to retrieve required information directly from the shared pool, eliminating
the need to inquire about other agents and wait for their responses, thus
enhancing communication efficiency.
'''
  • Put it all Together

To put everything together, all the previous code can be consolidated into a single helper function (check the GitHub repo), that takes in a file path and builds the router query engine with both vector search and summarization over it.

from utils import get_router_query_engine

query_engine = get_router_query_engine("metagpt.pdf")
response = query_engine.query("Tell me about the ablation study results?")

print(str(response))

'''
OUTPUT:
Selecting query engine 1: The ablation study results are specific context
from the MetaGPT paper, making choice 2 the most relevant..

The ablation study results provide insights into the impact of removing
certain components or features from a system or model. This analysis helps
in understanding the contribution and significance of individual elements
towards the overall performance or functionality of the system.

'''

Tool Calling

In a basic RAG pipeline, LLMs are only used for synthesis. The previous section showed how to make a decision by picking a choice of different pipelines. This, however, a simplified form of tool calling.

  • This section shows how to use an LLM to not only pick a function to execute, but also infer an argument to pass through the function.
  • Tool calling enables LLMs to interact with external environments through a dynamic interface where tool calling not only helps choosing the appropriate tool but also infer necessary arguments for execution.
  • Tool calling adds a layer of query understanding on top of a RAG pipeline, enables users to ask complex queries and get back more precise results.
  • This enables the LLM how to use the vector database, not only using its output.

# 1. Setup
from helper import get_openai_api_key
OPENAI_API_KEY = get_openai_api_key()

import nest_asyncio
nest_asyncio.apply()

# 2. Define a Simple Tool (to show how a tool calling works)
from llama_index.core.tools import FunctionTool

def add(x: int, y: int) -> int:
"""Adds two integers together."""
return x + y

def mystery(x: int, y: int) -> int:
"""Mystery function that operates on top of two numbers."""
return (x + y) * (x + y)

add_tool = FunctionTool.from_defaults(fn=add)
mystery_tool = FunctionTool.from_defaults(fn=mystery)

# 3. Call and define the model
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")
response = llm.predict_and_call(
[add_tool, mystery_tool],
"Tell me the output of the mystery function on 2 and 9",
verbose=True
)
print(str(response))

"""
OUTPUT

=== Calling Function ===
Calling function: mystery with args: {"x": 2, "y": 9}
=== Function Output ===
121
121
"""

The FunctionTool wraps any given python function that you feed it. Note the functions above; add and mystery have type annotations for both the x and y variables, as well as the docstring. This is not just for stylistic purposes, but also it’s so important as they will be used as a prompt for the LLM.

Note in the example above is that it’s an expanded version of the router. Not only does an LLM pick the tool, but also decides what parameter to give to the tool. This key concept will be used next to define a slightly more sophisticated agentic layer on top of the vector search, i.e., not only can the LLM chose a vector search, it can also infer metadata filters, which is a structured list of tags that helps to return a more precise set of search results.

  • Define an Auto-Retrieval Tool — Load Data
# 4. Load Data/ Documents

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_files=["metagpt.pdf"]).load_data()

# 5. Split into chunks
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

print(nodes[0].get_content(metadata_mode="all")) # Look at the very first chunk

"""
OUTPUT

page_label: 1
file_name: metagpt.pdf
file_path: metagpt.pdf
file_type: application/pdf
file_size: 16911937
creation_date: 2024-05-22
last_modified_date: 2024-04-23

Preprint
METAGPT: M ETA PROGRAMMING FOR A
MULTI -AGENT COLLABORATIVE FRAMEWORK
Sirui Hong1∗, Mingchen Zhuge2∗, Jonathan Chen1, Xiawu Zheng3, Yuheng Cheng4,
Ceyao Zhang4,Jinlin Wang1,Zili Wang ,Steven Ka Shing Yau5,Zijuan Lin4,
Liyang Zhou6,Chenyu Ran1,Lingfeng Xiao1,7,Chenglin Wu1†,J¨urgen Schmidhuber2,8
1DeepWisdom,2AI Initiative, King Abdullah University of Science and Technology,
3Xiamen University,4The Chinese University of Hong Kong, Shenzhen,
5Nanjing University,6University of Pennsylvania,
7University of California, Berkeley,8The Swiss AI Lab IDSIA/USI/SUPSI
ABSTRACT
...
†Chenglin Wu (alexanderwu@fuzhi.ai) is the corresponding author, affiliated with DeepWisdom.
1
"""

# 6. Define a vector score index (to build a RAG indexing pipeline over the nodes, and add embedding to each node, and get back a query engine.)
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine(similarity_top_k=2)

# 7. Query the RAG pipeline via metadata filters
from llama_index.core.vector_stores import MetadataFilters

query_engine = vector_index.as_query_engine(
similarity_top_k=2,
filters=MetadataFilters.from_dicts(
[
{"key": "page_label", "value": "2"}
]
)
)

response = query_engine.query(
"What are some high-level results of MetaGPT?",
)

print(str(response))

"""
OUTPUT

Some high-level results of MetaGPT include achieving a new state-of-the-art
in code generation benchmarks with 85.9% and 87.7% in Pass@1, outperforming
other popular frameworks like AutoGPT, LangChain, AgentVerse, and ChatDev.
Additionally, MetaGPT demonstrates robustness and efficiency by achieving
a 100% task completion rate in experimental evaluations, highlighting its
effectiveness in handling higher levels of software complexity and offering
extensive functionality.
"""

# 8. Print the metadata attached to the source nodes.
for n in response.source_nodes:
print(n.metadata)

"""
OUTPUT

{'page_label': '2', 'file_name': 'metagpt.pdf', 'file_path': 'metagpt.pdf',
'file_type': 'application/pdf', 'file_size': 16911937,
'creation_date': '2024-05-22', 'last_modified_date': '2024-04-23'}
"""
  • Define the Auto-Retrieval Tool / Enhancing Data Retrieval

This subsection allows to wrap the overall retrieval tool into a function. This function allows more precise retrieval by accepting a query string and optional metadata filters, e.g., a page number. The LLM can intelligently infer relevant metadata filters (e.g., page numbers) based on the user’s query, instead of having the user manually specified metadata filters. Note that one can define different type of metadata filters like section ID’s, headers, footers, etc.

from typing import List
from llama_index.core.vector_stores import FilterCondition


def vector_query(
query: str,
page_numbers: List[str]
) -> str:
"""Perform a vector search over an index.

query (str): the string query to be embedded.
page_numbers (List[str]): Filter by set of pages. Leave BLANK if we want to perform a vector search
over all pages. Otherwise, filter by the set of specified pages.

"""

metadata_dicts = [
{"key": "page_label", "value": p} for p in page_numbers
]

query_engine = vector_index.as_query_engine(
similarity_top_k=2,
filters=MetadataFilters.from_dicts(
metadata_dicts,
condition=FilterCondition.OR
)
)
response = query_engine.query(query)
return response

vector_query_tool = FunctionTool.from_defaults(
name="vector_tool",
fn=vector_query
)

llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
response = llm.predict_and_call(
[vector_query_tool],
"What are the high-level results of MetaGPT as described on page 2?",
verbose=True
)

"""
OUTPUT

=== Calling Function ===
Calling function: vector_tool with args: {"query": "high-level results of MetaGPT", "page_numbers": ["2"]}
=== Function Output ===
MetaGPT achieves a new state-of-the-art (SoTA) in code generation benchmarks with 85.9% and 87.7% in Pass@1. It stands out in handling higher levels of software complexity and offering extensive functionality, demonstrating robustness and efficiency in task completion.
"""
# Print the metadata attached to the source nodes.
for n in response.source_nodes:
print(n.metadata)

"""
OUTPUT

{'page_label': '2', 'file_name': 'metagpt.pdf', 'file_path': 'metagpt.pdf',
'file_type': 'application/pdf', 'file_size': 16911937,
'creation_date': '2024-05-22', 'last_modified_date': '2024-04-23'}

"""
  • Add some other Tools
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool

summary_index = SummaryIndex(nodes)
summary_query_engine = summary_index.as_query_engine(
response_mode="tree_summarize",
use_async=True,
)
summary_tool = QueryEngineTool.from_defaults(
name="summary_tool",
query_engine=summary_query_engine,
description=(
"Useful if you want to get a summary of MetaGPT"
),
)

response = llm.predict_and_call(
[vector_query_tool, summary_tool],
"What are the MetaGPT comparisons with ChatDev described on page 8?",
verbose=True
)

"""
OUTPUT
=== Calling Function ===
Calling function: vector_tool with args: {"query": "MetaGPT comparisons with ChatDev", "page_numbers": ["8"]}
=== Function Output ===
MetaGPT outperforms ChatDev on the challenging SoftwareDev dataset in nearly all metrics. For example, MetaGPT achieves a higher score in executability, takes less time for software generation, uses more tokens but requires fewer tokens to generate one line of code compared to ChatDev. Additionally, MetaGPT demonstrates better code statistic and human revision cost performance when compared to ChatDev.
"""
for n in response.source_nodes:
print(n.metadata)

{'page_label': '8', 'file_name': 'metagpt.pdf', 'file_path': 'metagpt.pdf',
'file_type': 'application/pdf', 'file_size': 16911937,
'creation_date': '2024-05-22', 'last_modified_date': '2024-04-23'}
response = llm.predict_and_call(
[vector_query_tool, summary_tool],
"What is a summary of the paper?",
verbose=True
)

"""
OUTPUT

=== Calling Function ===
Calling function: summary_tool with args: {"input": "Please provide a summary of the paper."}
=== Function Output ===
The paper introduces MetaGPT, a meta-programming framework that enhances multi-agent collaboration in software engineering tasks based on Large Language Models (LLMs). MetaGPT incorporates Standard Operating Procedures (SOPs) to streamline workflows, assign specific roles to agents, and improve task decomposition. It models a group of agents as a simulated software company, emphasizing role specialization, workflow management, and efficient communication mechanisms. The framework achieves state-of-the-art performance in code generation benchmarks and offers a robust platform for developing LLM-based multi-agent systems. Additionally, MetaGPT enables collaborative software development through active teamwork, recursive self-improvement, and a self-referential mechanism that modifies prompts based on project feedback. The system leverages LLMs to enhance prompts and performance on downstream tasks, showcasing superior performance in generating functional applications and high-quality code. The paper also discusses productivity ratios, challenges addressed by MetaGPT, limitations, ethical concerns, and the sensitivity of different GPT models to prompts and post-processing.

"""

Building an Agent Reasoning Loop

So far, the queries have been done on a single forward pass; given the query, call the right tool with the right parameters, and get back the response. This, however, is still quite limiting. What if the user asks a complex question consisting of multiple steps, or a vague question that needs clarification!

This subsection teaches how to define a complete agent reasoning loop. Instead of tool calling it in a single shot setting, an agent is able to reason over tools and multiple steps. The function calling agent representation will be used, which is an agent that natively integrates with the function-calling capabilities of LLMs.

In LlamaIndex, an agent consists of two main components, an AgentWorker as well as an AgentRunner. Think of an AgentWorker that’s responsible for executing the next step of a given agent and the AgentRunner is the overall task dispatcher, which is responsible for creating task, orchestrating runs of agent worker on top of a given task, and being able to return back the final response to the user.

# Setup
from helper import get_openai_api_key
OPENAI_API_KEY = get_openai_api_key()

import nest_asyncio
nest_asyncio.apply()


# Setup the Query Tools
from utils import get_doc_tools

vector_tool, summary_tool = get_doc_tools("metagpt.pdf", "metagpt")

# Setup Function Calling Agent
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
[vector_tool, summary_tool],
llm=llm,
verbose=True
)
agent = AgentRunner(agent_worker)

response = agent.query(
"Tell me about the agent roles in MetaGPT, "
"and then how they communicate with each other."
)

print(response.source_nodes[0].get_content(metadata_mode="all"))
"""
OUTPUT

page_label: 9
file_name: metagpt.pdf
file_path: metagpt.pdf
file_type: application/pdf
file_size: 16911937
creation_date: 2024-05-25
last_modified_date: 2024-04-23

Preprint
Table 2: Comparison of capabilities for MetaGPT and other approaches. ‘!’ indicates the
presence of a specific feature in the corresponding framework, ‘ %’ its absence.
...
"""

Full Agent Reasoning Loop

Calling agent.query allows you to query the agent in a one-off manner, but doesn’t preserve state. In the following, conversation history is gonna be maintained over time.

Full Agent Reasoning Loop

The agent is capable of maintaining chats in a conversational memory buffer. The memory module can be customized, but it’s by default a flat list of items that’s a rolling buffer depending on the size of the context window of the LLM. Therefore, when the agent decides to use a tool, it does not only use it in the current chat, but also the previous conversation history to take the next step or perform the next action

response = agent.chat(
"Tell me about the evaluation datasets used."
)

"""
OUTPUT

Added user message to memory: Tell me about the evaluation datasets used.
=== Calling Function ===
Calling function: summary_tool_metagpt with args: {"input": "evaluation datasets used in MetaGPT"}
=== Function Output ===
The evaluation datasets used in MetaGPT include HumanEval, MBPP, and SoftwareDev.
=== LLM Response ===
The evaluation datasets used in MetaGPT include HumanEval, MBPP, and SoftwareDev.

"""
response = agent.chat("Tell me the results over one of the above datasets.")

"""
OUTPUT

Added user message to memory: Tell me the results over one of the above datasets.
=== Calling Function ===
Calling function: vector_tool_metagpt with args: {"query": "results over HumanEval dataset", "page_numbers": ["7"]}
=== Function Output ===
MetaGPT achieved 85.9% and 87.7% Pass rates over the HumanEval dataset.
=== LLM Response ===
MetaGPT achieved 85.9% and 87.7% Pass rates over the HumanEval dataset.

"""

Lower-Level: Debuggability and Control

This subsection shows capabilities that let one steps through and control the agent in a much more granular fashion.

Key benefits:

  • Decoupling of Task Creation and Execution: Users gain the flexibility to schedule task execution according to their needs.
  • Enhance Debuggability: Offers deeper insights into each step of the execution process, improving troubleshooting capabilities.
  • Steerability: Allows users to directly modify intermediate steps and incorporate human feedback for refined control.
agent_worker = FunctionCallingAgentWorker.from_tools(
[vector_tool, summary_tool],
llm=llm,
verbose=True
)
agent = AgentRunner(agent_worker)

task = agent.create_task(
"Tell me about the agent roles in MetaGPT, "
"and then how they communicate with each other."
)

step_output = agent.run_step(task.task_id)

"""
OUTPUT

Added user message to memory: Tell me about the agent roles in MetaGPT, and then how they communicate with each other.
=== Calling Function ===
Calling function: summary_tool_metagpt with args: {"input": "agent roles in MetaGPT"}
=== Function Output ===
The agent roles in MetaGPT include Product Manager, Architect, Project Manager, Engineer, and QA Engineer. Each role has specific responsibilities and expertise tailored to different aspects of the collaborative software development process within the framework. The Product Manager analyzes user requirements and formulates a detailed PRD, the Architect translates requirements into system design components, the Project Manager handles task distribution, Engineers execute designated classes and functions, and the QA Engineer formulates test cases to ensure code quality. These roles are designed to facilitate the breakdown of complex tasks into smaller, specialized components, ensuring efficient workflow and effective problem-solving capabilities in the software development process within MetaGPT.
"""

completed_steps = agent.get_completed_steps(task.task_id)
print(f"Num completed for task {task.task_id}: {len(completed_steps)}")
print(completed_steps[0].output.sources[0].raw_output)

"""
OUTPUT

Num completed for task d72fe7a9-9614-4460-81b8-9b592aaf285e: 1
The agent roles in MetaGPT include Product Manager, Architect, Project Manager, Engineer, and QA Engineer. Each role has specific responsibilities and expertise tailored to different aspects of the collaborative software development process within the framework. The Product Manager analyzes user requirements and formulates a detailed PRD, the Architect translates requirements into system design components, the Project Manager handles task distribution, Engineers execute designated classes and functions, and the QA Engineer formulates test cases to ensure code quality. These roles are designed to facilitate the breakdown of complex tasks into smaller, specialized components, ensuring efficient workflow and effective problem-solving capabilities in the software development process within MetaGPT.
"""

upcoming_steps = agent.get_upcoming_steps(task.task_id)
print(f"Num upcoming steps for task {task.task_id}: {len(upcoming_steps)}")
upcoming_steps[0]

"""
OUTPUT

Num upcoming steps for task d72fe7a9-9614-4460-81b8-9b592aaf285e: 1
TaskStep(task_id='d72fe7a9-9614-4460-81b8-9b592aaf285e', step_id='97d1e688-e131-4811-9aac-469551408f54', input=None, step_state={}, next_steps={}, prev_steps={}, is_ready=True)
"""

step_output = agent.run_step(
task.task_id, input="What about how agents share information?"
)

"""
OUTPUT

step_output = agent.run_step(
task.task_id, input="What about how agents share information?"
)
step_output = agent.run_step(
task.task_id, input="What about how agents share information?"
)
Added user message to memory: What about how agents share information?
=== Calling Function ===
Calling function: summary_tool_metagpt with args: {"input": "how agents share information in MetaGPT"}
=== Function Output ===
Agents in MetaGPT share information through a structured communication protocol that includes a shared message pool and a publish-subscribe mechanism. This allows agents to publish structured messages, access messages from other entities, and subscribe to relevant information based on their profiles. The subscription mechanism ensures that agents receive task-related information, enhancing communication efficiency and relevance. Additionally, agents review previous feedback, adjust constraint prompts, and summarize information for future projects, enabling continuous learning and improvement within the multi-agent system. The structured workflow involves the Product Manager generating a Product Requirement Document, which is then passed to the Architect for system design, followed by task breakdown and assignment to Engineers, code review by QA Engineers, and unit test code generation to ensure high-quality software.

"""

step_output = agent.run_step(task.task_id)
print(step_output.is_last)

"""
OUTPUT

step_output = agent.run_step(task.task_id)
print(step_output.is_last)
step_output = agent.run_step(task.task_id)
print(step_output.is_last)
=== LLM Response ===
Agents in MetaGPT share information through a structured communication protocol that includes a shared message pool and a publish-subscribe mechanism. This allows agents to publish structured messages, access messages from other entities, and subscribe to relevant information based on their profiles. The subscription mechanism ensures that agents receive task-related information, enhancing communication efficiency and relevance. Additionally, agents review previous feedback, adjust constraint prompts, and summarize information for future projects, enabling continuous learning and improvement within the multi-agent system. The structured workflow involves the Product Manager generating a Product Requirement Document, which is then passed to the Architect for system design, followed by task breakdown and assignment to Engineers, code review by QA Engineers, and unit test code generation to ensure high-quality software.
True
"""

response = agent.finalize_response(task.task_id)
print(str(response))

"""
OUTPUT

assistant: Agents in MetaGPT share information through a structured communication protocol that includes a shared message pool and a publish-subscribe mechanism. This allows agents to publish structured messages, access messages from other entities, and subscribe to relevant information based on their profiles. The subscription mechanism ensures that agents receive task-related information, enhancing communication efficiency and relevance. Additionally, agents review previous feedback, adjust constraint prompts, and summarize information for future projects, enabling continuous learning and improvement within the multi-agent system. The structured workflow involves the Product Manager generating a Product Requirement Document, which is then passed to the Architect for system design, followed by task breakdown and assignment to Engineers, code review by QA Engineers, and unit test code generation to ensure high-quality software.
"""

Building a Multi-Document Agent

In the previous subsection, we’ve built an agent that can reason over a single document and also capable of answering complex questions over it while maintaining memory.

This subsection, however, shows how to extend that agent to handle multiple documents in increasing degrees of complexity.

Setting an Agent over 3 Papers

In the following you’ll find a step-by-step illustration of the process supported with the code. This includes setting up a function calling agent over these three papers. This is done by combining the vector summary tools for each document into a list, and passing it into an agent, actually has six tools in total.

  • Downloading the three papers.
  • Convert each paper into a tools. Recall the helper function called get_doc_tools.py which automatically builds both a vector index tool, as well as a summary index tool over a given paper. And so the vector tool performs vector search, and the summary tool performs summarization over the entire document. So for each paper we get back both the vector tool and the summary tool, and then they will be put into the overall dictionary, mapping each paper name to the vector tool and the summary tool.
  • Next we simply get these tools in a flat list.
  • Then defining the proper OpenAI model.
  • Since we have three papers, we have two tools for each, i.e., six tools in total.
  • The next step is to construct the overall agent worker. And this agent worker includes the six tools as well as the LLM that we pass out.
  • Then, we’ll be able to ask questions across these three documents or within a single document.
from helper import get_openai_api_key
OPENAI_API_KEY = get_openai_api_key()

import nest_asyncio
nest_asyncio.apply()

urls = [
"https://openreview.net/pdf?id=VtmBAGCN7o",
"https://openreview.net/pdf?id=6PmJoRfdaK",
"https://openreview.net/pdf?id=hSyW5go0v8",
]

papers = [
"metagpt.pdf",
"longlora.pdf",
"selfrag.pdf",
]

from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
print(f"Getting tools for paper: {paper}")
vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
paper_to_tools_dict[paper] = [vector_tool, summary_tool]

initial_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
initial_tools,
llm=llm,
verbose=True
)
agent = AgentRunner(agent_worker)

response = agent.query(
"Tell me about the evaluation dataset used in LongLoRA, "
"and then tell me about the evaluation results"
)

"""
OUTPUT

Added user message to memory: Tell me about the evaluation dataset used in LongLoRA, and then tell me about the evaluation results
=== Calling Function ===
Calling function: vector_tool_longlora with args: {"query": "evaluation dataset"}
=== Function Output ===
PG19 test split
=== Calling Function ===
Calling function: vector_tool_longlora with args: {"query": "evaluation results"}
=== Function Output ===
The evaluation results show that the models achieve better perplexity with longer context sizes. Increasing the context window size leads to improved perplexity values. Additionally, the models are fine-tuned on different context lengths, such as 8192, 32768, 100k, 65536, and 32768, showcasing promising results on extremely large settings. However, there is some perplexity degradation observed on small context sizes for the extended models, which is a recognized limitation of Position Interpolation.
=== LLM Response ===
The evaluation dataset used in LongLoRA is the PG19 test split.

Regarding the evaluation results, the models in LongLoRA achieve better perplexity with longer context sizes. Increasing the context window size leads to improved perplexity values. The models are fine-tuned on different context lengths, such as 8192, 32768, 100k, 65536, and 32768, showcasing promising results on extremely large settings. However, there is some perplexity degradation observed on small context sizes for the extended models, which is a recognized limitation of Position Interpolation.
"""
response = agent.query("Give me a summary of both Self-RAG and LongLoRA")
print(str(response))

"""
OUTPUT

Added user message to memory: Give me a summary of both Self-RAG and LongLoRA
=== Calling Function ===
Calling function: summary_tool_selfrag with args: {"input": "Self-RAG"}
=== Function Output ===
Self-RAG is a framework that enhances the quality and factuality of a large language model through retrieval and self-reflection. It improves the generation process by training a single LM to adaptively retrieve passages on-demand, generate and reflect on retrieved passages, and its own generations using special tokens called reflection tokens. This framework enables the LM to tailor its behavior to diverse task requirements during the inference phase, leading to significant performance improvements on various tasks compared to state-of-the-art LLMs and retrieval-augmented models.
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "LongLoRA"}
=== Function Output ===
LongLoRA is an efficient method for extending the context length of Large Language Models (LLMs) while minimizing computational costs. It incorporates shifted sparse attention (S2-Attn) during fine-tuning to achieve this extension, demonstrating strong empirical results on various tasks. By combining improved LoRA with S2-Attn, LongLoRA efficiently extends LLMs' context while maintaining their original architectures and compatibility with existing techniques like Flash-Attention2.
=== LLM Response ===
Self-RAG is a framework that enhances the quality and factuality of a large language model through retrieval and self-reflection. It improves the generation process by training a single LM to adaptively retrieve passages on-demand, generate and reflect on retrieved passages, and its own generations using special tokens called reflection tokens. This framework enables the LM to tailor its behavior to diverse task requirements during the inference phase, leading to significant performance improvements on various tasks compared to state-of-the-art LLMs and retrieval-augmented models.

LongLoRA is an efficient method for extending the context length of Large Language Models (LLMs) while minimizing computational costs. It incorporates shifted sparse attention (S2-Attn) during fine-tuning to achieve this extension, demonstrating strong empirical results on various tasks. By combining improved LoRA with S2-Attn, LongLoRA efficiently extends LLMs' context while maintaining their original architectures and compatibility with existing techniques like Flash-Attention2.
assistant: Self-RAG is a framework that enhances the quality and factuality of a large language model through retrieval and self-reflection. It improves the generation process by training a single LM to adaptively retrieve passages on-demand, generate and reflect on retrieved passages, and its own generations using special tokens called reflection tokens. This framework enables the LM to tailor its behavior to diverse task requirements during the inference phase, leading to significant performance improvements on various tasks compared to state-of-the-art LLMs and retrieval-augmented models.

LongLoRA is an efficient method for extending the context length of Large Language Models (LLMs) while minimizing computational costs. It incorporates shifted sparse attention (S2-Attn) during fine-tuning to achieve this extension, demonstrating strong empirical results on various tasks. By combining improved LoRA with S2-Attn, LongLoRA efficiently extends LLMs' context while maintaining their original architectures and compatibility with existing techniques like Flash-Attention2.
"""

Setting an Agent over 11 Papers

  • Similar to the previous subsection, we’ll now build a dictionary mapping each paper to its vector and summary tool.
  • Now is the point at which we need a slightly more advanced agent and tool architecture. The issue is that let’s say we try to index all eleven papers which now include 22 tools. Or let’s say we want to index 100 papers or 1000 papers or more. Even though LLM context windows are getting longer, stuffing too many tool selections into the LLM prompt leads to the following issue:

The tools may not all fit in the prompt, especially if the number of documents is big and you’re modeling each document as a separate tool or a set of tools. Costs and latency would spike because you’re increasing the number of tokens in your prompt, and also the LLM can actually get confused. It may fail to pick up the right tool when the number of choices is too large.

A solution here is that when the user asks a query, we actually perform retrieval augmentation. But not on the level of text, but actually on the level of tool. We first retrieve a small set of relevant tolls, and then feed the relevant tools to the agent reasoning prompt instead of all the tools. This retrieval process is similar to the retrieval process used in RAG. At its simplest, it can just be top k vector search.

urls = [
"https://openreview.net/pdf?id=VtmBAGCN7o",
"https://openreview.net/pdf?id=6PmJoRfdaK",
"https://openreview.net/pdf?id=LzPWWPAdY4",
"https://openreview.net/pdf?id=VTF8yNQM66",
"https://openreview.net/pdf?id=hSyW5go0v8",
"https://openreview.net/pdf?id=9WD9KwssyT",
"https://openreview.net/pdf?id=yV6fD7LYkF",
"https://openreview.net/pdf?id=hnrB5YHoYu",
"https://openreview.net/pdf?id=WbWtOYIzIK",
"https://openreview.net/pdf?id=c5pwL0Soay",
"https://openreview.net/pdf?id=TpD2aG1h0D"
]

papers = [
"metagpt.pdf",
"longlora.pdf",
"loftq.pdf",
"swebench.pdf",
"selfrag.pdf",
"zipformer.pdf",
"values.pdf",
"finetune_fair_diffusion.pdf",
"knowledge_card.pdf",
"metra.pdf",
"vr_mcl.pdf"
]

# To download these papers, below is the needed code:

#for url, paper in zip(urls, papers):
#!wget "{url}" -O "{paper}"

from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
print(f"Getting tools for paper: {paper}")
vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
paper_to_tools_dict[paper] = [vector_tool, summary_tool]

"""
OUTPUT

Getting tools for paper: metagpt.pdf
Getting tools for paper: longlora.pdf
Getting tools for paper: loftq.pdf
Getting tools for paper: swebench.pdf
Getting tools for paper: selfrag.pdf
Getting tools for paper: zipformer.pdf
Getting tools for paper: values.pdf
Getting tools for paper: finetune_fair_diffusion.pdf
Getting tools for paper: knowledge_card.pdf
Getting tools for paper: metra.pdf
Getting tools for paper: vr_mcl.pdf
"""

Extend the Agent with Tool Retrieval

But of course you can add all retrieval techniques you want, to filter out the relevant set of results. Our agents let you plug in a tool retriever, that allows you to accomplish exactly this.

The following shows how to get this done.

  1. We have to index the tools. LlamaIndex already has extensive indexing capabilities over general text documents. But since these tools are actually Python Objects, we need a way to convert and serialize these objects to a string representation and back. This is solved through the object index abstraction in LlamaIndex.
  2. We use Vector Store Index, which is our standard interface for indexing text. And then we wrap this with object index.
  3. And to construct an object index, we directly plug the python tools as input into the index.
  4. You can retrieve from an object index through an object retriever. This will call the underlying retriever or from the index, and return the output directly as objects. In this case, there will be tools
  5. After defining the object retriever, we’ll walk through a very simple example.
all_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

obj_index = ObjectIndex.from_objects(
all_tools,
index_cls=VectorStoreIndex,
)

obj_retriever = obj_index.as_retriever(similarity_top_k=3)

tools = obj_retriever.retrieve(
"Tell me about the eval dataset used in MetaGPT and SWE-Bench"
)

tools[2].metadata

"""
OUTPUT
ToolMetadata(description='Useful for summarization questions related to swebench', name='summary_tool_swebench', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)
"""

from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
tool_retriever=obj_retriever,
llm=llm,
system_prompt=""" \
You are an agent designed to answer queries over a set of given papers.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
verbose=True
)
agent = AgentRunner(agent_worker)

response = agent.query(
"Tell me about the evaluation dataset used "
"in MetaGPT and compare it against SWE-Bench"
)


print(tools[0].metadata)

"""
OUTPUT:

ToolMetadata(description='Useful for summarization questions related to metagpt', name='summary_tool_metagpt', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)
"""

print(tools[1].metadata)

"""
OUTPUT:

ToolMetadata(description='Useful for summarization questions related to metagpt', name='summary_tool_metagpt', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)"""

print(tools[2].metadata)

"""
OUTPUT:

ToolMetadata(description='Useful for summarization questions related to swebench', name='summary_tool_swebench', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)
"""

print(str(response))

"""
OUTPUT

Added user message to memory: Tell me about the evaluation dataset used in MetaGPT and compare it against SWE-Bench
=== Calling Function ===
Calling function: summary_tool_metagpt with args: {"input": "evaluation dataset used in MetaGPT"}
=== Function Output ===
The evaluation dataset used in MetaGPT includes three benchmarks: HumanEval, MBPP, and SoftwareDev. HumanEval consists of 164 handwritten programming tasks, MBPP comprises 427 Python tasks, and SoftwareDev is a collection of 70 representative software development tasks covering various scopes like mini-games, image processing algorithms, and data visualization.
=== Calling Function ===
Calling function: summary_tool_swebench with args: {"input": "evaluation dataset used in SWE-Bench"}
=== Function Output ===
The evaluation dataset used in SWE-Bench consists of task instances collected from real GitHub issues and pull requests across popular Python repositories. It includes task instructions, issue text, retrieved files and documentation, example patch files, and prompts for generating patch files. The dataset is challenging and designed to be easily extended to new programming languages and repositories, providing a realistic and diverse environment for enhancing language models with software engineering tools and practices.
=== LLM Response ===
The evaluation dataset used in MetaGPT includes three benchmarks: HumanEval, MBPP, and SoftwareDev. HumanEval consists of 164 handwritten programming tasks, MBPP comprises 427 Python tasks, and SoftwareDev is a collection of 70 representative software development tasks covering various scopes like mini-games, image processing algorithms, and data visualization.

On the other hand, the evaluation dataset used in SWE-Bench consists of task instances collected from real GitHub issues and pull requests across popular Python repositories. It includes task instructions, issue text, retrieved files and documentation, example patch files, and prompts for generating patch files. The dataset is challenging and designed to be easily extended to new programming languages and repositories, providing a realistic and diverse environment for enhancing language models with software engineering tools and practices.
assistant: The evaluation dataset used in MetaGPT includes three benchmarks: HumanEval, MBPP, and SoftwareDev. HumanEval consists of 164 handwritten programming tasks, MBPP comprises 427 Python tasks, and SoftwareDev is a collection of 70 representative software development tasks covering various scopes like mini-games, image processing algorithms, and data visualization.

On the other hand, the evaluation dataset used in SWE-Bench consists of task instances collected from real GitHub issues and pull requests across popular Python repositories. It includes task instructions, issue text, retrieved files and documentation, example patch files, and prompts for generating patch files. The dataset is challenging and designed to be easily extended to new programming languages and repositories, providing a realistic and diverse environment for enhancing language models with software engineering tools and practices.
"""

Let’s have a look at the first tool of the list (tools[1].metadata). We see that we actually directly retrieved a set of tools, and that the first tool is the summary tool for MetaGPT. If we take a look at the second tool, we see that it’s a summary tool for an unrelated paper to MetaGPT. So of course the quality of retrieval so dependent on the embedding model. However, we see the last tool (tools[2].metadata), that’s retrieved is indeed the summary tool for Swebench. Now we’re ready to set up our function calling agent. We note that the setup is pretty similar to the setup in the previous section. However, just as an additional feature, we show that you can actually add a system prompt to the agent if you want. This is, however, optional, and you don’t need to specify this. But you can if you want, just additional guidance if you want to prompt the agent to output things in a certain way. If you want it to take into account certain factors when it reasons over these tools.

response = agent.query(
"Compare and contrast the LoRA papers (LongLoRA, LoftQ). "
"Analyze the approach in each paper first. "
)

"""
OUTPUT

Added user message to memory: Compare and contrast the LoRA papers (LongLoRA, LoftQ). Analyze the approach in each paper first.
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "LongLoRA"}
=== Function Output ===
LongLoRA is an efficient method for extending the context length of Large Language Models (LLMs) while maintaining computational efficiency. It combines shifted sparse attention (S2-Attn) with LoRA to effectively fine-tune models to longer context lengths. LongLoRA demonstrates strong empirical results on various tasks and models, showcasing its effectiveness in extending the context window of LLMs. Additionally, it aims to bridge the performance gap between LoRA and full fine-tuning when adapting LLMs from short to long context lengths, enhancing the adaptation process for long contexts and showing improved performance compared to standard LoRA in experiments.
=== Calling Function ===
Calling function: summary_tool_loftq with args: {"input": "LoftQ"}
=== Function Output ===
LoftQ is a novel quantization framework that combines quantization with Low-Rank Adaptation (LoRA) fine-tuning for pre-trained models. It integrates low-rank approximation with quantization to jointly approximate the original high-precision pre-trained weights, improving alignment with the original weights and providing a beneficial initialization point for subsequent LoRA fine-tuning. LoftQ has demonstrated superior performance compared to existing quantization methods, especially in challenging low-bit scenarios, and has been effective across various downstream tasks with different quantization methods.
=== LLM Response ===
LongLoRA is an efficient method for extending the context length of Large Language Models (LLMs) while maintaining computational efficiency. It combines shifted sparse attention (S2-Attn) with LoRA to effectively fine-tune models to longer context lengths. LongLoRA demonstrates strong empirical results on various tasks and models, showcasing its effectiveness in extending the context window of LLMs. Additionally, it aims to bridge the performance gap between LoRA and full fine-tuning when adapting LLMs from short to long context lengths, enhancing the adaptation process for long contexts and showing improved performance compared to standard LoRA in experiments.

LoftQ is a novel quantization framework that combines quantization with Low-Rank Adaptation (LoRA) fine-tuning for pre-trained models. It integrates low-rank approximation with quantization to jointly approximate the original high-precision pre-trained weights, improving alignment with the original weights and providing a beneficial initialization point for subsequent LoRA fine-tuning. LoftQ has demonstrated superior performance compared to existing quantization methods, especially in challenging low-bit scenarios, and has been effective across various downstream tasks with different quantization methods.
"""

Conclusion

In this article you’ve learnt about agentic RAG starting from building a router agent to tool calling to building your own agent, that can reason not just over a single document, but over multiple documents.

My Demo

GitHub repo for my app using this approach.

--

--

Sulaiman Shamasna

An experienced Data Scinetist and Machine Learning Engineer with main focus on LLMs & MLOps. In addition to a deep background in Philosophy, Physics, and Maths.