Building A Universal Data Agent in 15 Minutes with LlamaIndex and Apache Gravitino (incubating)

Lisa N. Cao
Datastrato
Published in
9 min readJul 23, 2024
Blogpost by Jerry Shao & Lisa N. Cao

In this new, modern era of data and generative intelligence, data infrastructure teams are struggling with getting company data served in a way that is convenient, efficient, and compliant with regulations. This is especially crucial in the development of Large Language Models (LLMs) and Agentic Retrieval-Augmented Generation (RAG), which have taken the analytics world by storm. In this article, we will cover how you can build a data agent from scratch and interact with it using an open source data catalog.

What is a LLM Agent?

Before we get started, we should first review the role of Agents in RAG Pipelines. While LLMs themselves lack advanced reasoning capabilities and provide the general ability to understand and generate language, Agents are used to take it a step further by being able to take instructions to perform more complex, domain specific reasoning that then gets fed back into the LLM.

Original image by Jerry Shao, inspired by Role of LLM agents at a glance — Source: LinkedIn

Agents can be used for different purposes and in different areas, for example, mathematical problem solving, retrieval-augmented chat, personal assistants, etc. A data agent is typically designed for an extractive goal by directly interacting with the data itself. By helping with assistive reasoning tasks, general application performance can improve greatly and responses more accurately.

Below is the general architecture of a data agent.

Original image by Jerry Shao, inspired by Role of LLM agents at a glance — Source: LinkedIn

As you can see, the agent will take instructions from the LLM, and depending on the design it will interface with the user or LLM through a set of APIs or other agents. It then breaks down large tasks into smaller ones through planning, with some reflection and refinement capabilities. Combined with memory, the agent will retain and recall information over long context windows through the use of vector store and retrieval methods. Agents can also call external APIs to fill in missing information from alternate data sources, which is extremely useful.

Production Issues in RAG Development

There already exist plenty of demos, POCs, and tutorials describing how to build a simple data agent, but when we turn to production usage, we still face several challenges.

Data quality and integrity

Regardless of which LLM you use, data quality and integrity will directly affect the accuracy of the answers. The quality of metadata for structured data will directly result in the accuracy of generated SQL statements, which can have ill-intended effects. Regardless of your chunking strategy, poor source data and documents will pollute the quality of your vector embeddings and result in poor retrieval results that can be nonsensical or hallucinatory. The saying “garbage in, garbage out” is more important than ever before in the age of generative AI.

Retrieve information from a wide range of sources

In any organization or a company, data will likely be ingested from a wide range of sources. On top of a given variety of formats and storage solutions, data may also need to traverse from one data center/region into cross-regional, cross-cloud distributions. If we cannot successfully connect and retrieve data from the entire organization’s wide range of sources efficiently, we open ourselves to a huge disadvantage by missing key data and relationships, making it hard to implement a knowledge graph or map out similarities for our LLM to provide accurate answers from. In the meantime, the traditional way of ETL to centralize the data will also lower the effectiveness of the answers, as usually you will need T+1 to get data prepared.

Data privacy, security, and compliance

Data privacy, security, and compliance are paramount when building any production-level data system, including data agents and APIs. This problem becomes more challenging when implementing LLMs because of their tendency to be incredibly high dimensional and complex at scale and thus trace their outputs from the source. Troubleshooting such systems, especially when making many calls to external tools and APIs, is very hard to do- let alone while retaining privacy and security. It is important to design our data infrastructure and end-to-end systems to have high visibility, observability, measurability, and robustness in a continuous way.

What is Apache Gravitino (incubating)?

Apache (incubating) Gravitino is a high-performance, geo-distributed, and federated metadata lake. By using a technical data catalog and metadata lake, you can manage access and perform data governance for all your data sources (including filestores, relational databases, and event streams) while safely using multiple engines like Spark, Trino, or Flink on multiple formats on different cloud providers. This is very useful for us to plug into our data architecture when trying to get LlamaIndex up and running quickly on top of numerous data sources at the same time.

Apache (incubating) Gravitino’s Architecture at A Glance

With Gravitino, you can achieve:

  • Single Source of Truth for multi-regional data with geo-distributed architecture support.
  • Unified Data and AI asset management for both users and engines.
  • Security in one place, centralizing the security for different sources.
  • Built-in data management and data access management.
  • An AI-ready and low cost metadata fabric that standardizes across all your data stores.

For more details about Gravitino, please refer to our blogpost Gravitino — the unified metadata lake.

Without Gravitino, a typical agentic RAG system would look like this:

Image by Jerry Shao, inspired by LlamaIndex flow — Source: LlamaIndex

Users would need to use different readers to connect to various sources one by one, the difficulties will be multiplied when data is distributed across clouds with varying security policies.

With Gravitino, the new architecture is streamlined:

Image by Jerry Shao, inspired by LlamaIndex flow — Source: LlamaIndex

Using Gravitino and LlamaIndex to build a Universal Data Agent

Now, let’s show how you can build a data agent in 15 minutes. This data agent will have several advantages:

  • No data movement: data will stay where it is, and there’s no need to preprocess or aggregate data together.
  • Obtain answers both from structured and unstructured data.
  • Natural language interface. Using natural language to ask the data questions, which will automatically decompose into subqueries and generate SQL as required.

Environment Setup

Below we have abstracted out the code you will need to reproduce this on your own. If you are interested in running this step by step with us, we have a prepared setup that can be run locally. Keep in mind, to run this demo you will need an OpenAI API key.

To learn more about the playground, see here: Apache Gravitino Demo Playground

git clone git@github.com:apache/gravitino-playground.git
cd gravitino-playground
./launch-playground.sh

From there, you will need to navigate to the Jupyter Notebook through the following steps:

  1. Open the Jupyter Notebook in the browser at http://localhost:8888
  2. Open the gravitino_llamaIndex_demo.ipynb notebook
  3. Start the notebook and run the cells

The overall architecture of the demo that is included in the local playground looks like this:

Manage datasets using Gravitino

First, we’ll need to set up our first catalog and connect it to our filesets. In our case, the data source is Hadoop. We’ll then need to define the schemas and provide the storage location.

demo_catalog = None
try:
demo_catalog = gravitino_client.load_catalog(name=catalog_name)
except Exception as e:
demo_catalog = gravitino_client.create_catalog(name=catalog_name,
catalog_type=Catalog.Type.FILESET,
comment="demo",
provider="hadoop",
properties={})

# Create schema and fileset
schema_countries = None
try:
schema_countries = demo_catalog.as_schemas().load_schema(ident=schema_ident)
except Exception as e:
schema_countries = demo_catalog.as_schemas().create_schema(ident=schema_ident,
comment="countries",
properties={})

fileset_cities = None
try:
fileset_cities = demo_catalog.as_fileset_catalog().load_fileset(ident=fileset_ident)
except Exception as e:
fileset_cities = demo_catalog.as_fileset_catalog().create_fileset(ident=fileset_ident,
fileset_type=Fileset.Type.EXTERNAL,
comment="cities",
storage_location="/tmp/gravitino/data/pdfs",
properties={})

Build a Gravitino structured data reader

Once our data sources are connected, we’ll need to query it somehow. We’ve decided to use Trino, connected via sqlalchemy in this case to help us out. You could also use PySpark, however if that is what your team already uses.

from sqlalchemy import create_engine
from trino.sqlalchemy import URL
from sqlalchemy.sql.expression import select, text

trino_engine = create_engine('trino://admin@trino:8080/catalog_mysql/demo_llamaindex')

connection = trino_engine.connect();

with trino_engine.connect() as connection:
cursor = connection.exec_driver_sql("SELECT * FROM catalog_mysql.demo_llamaindex.city_stats")
print(cursor.fetchall())

Build a Gravitino unstructured data reader

Once our basic data infrastructure has been set up, we can now directly read it into LlamaIndex. Gravitino will use a virtual file system to serve the data as a directory that LlamaIndex can take as input.

from llama_index.core import SimpleDirectoryReader
from gravitino import gvfs

fs = gvfs.GravitinoVirtualFileSystem(
server_uri=gravitino_url,
metalake_name=metalake_name
)

fileset_virtual_location = "fileset/catalog_fileset/countries/cities"

reader = SimpleDirectoryReader(
input_dir=fileset_virtual_location,
fs=fs,
recursive=True)
wiki_docs = reader.load_data()

Build SQL metadata index from the structured data connection

Once built, we can now begin to build our index and vector stores from the metadata alone.

from llama_index.core import SQLDatabase
sql_database = SQLDatabase(trino_engine, include_tables=["city_stats"])

Build vector index from unstructured data

from llama_index.core import VectorStoreIndex
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

# Insert documents into vector index
# Each document has metadata of the city attached

vector_indices = {}
vector_query_engines = {}

for city, wiki_doc in zip(cities, wiki_docs):
vector_index = VectorStoreIndex.from_documents([wiki_doc])

query_engine = vector_index.as_query_engine(
similarity_top_k=2, llm=OpenAI(model="gpt-3.5-turbo")
)

vector_indices[city] = vector_index
vector_query_engines[city] = query_engine

Define query engines and ask the questions

To make this a fully functioning chat application, we will need to be able to provide a text to SQL interface to pull it all together. In this case we will use LlamaIndex’s native functions to directly interface with the index we defined in the previous steps.

from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.core.query_engine import SQLJoinQueryEngine

# Define the NL to SQL engine
sql_query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=["city_stats"],
)


# Define the vector query engines for each city
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools import ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine


query_engine_tools = []
for city in cities:
query_engine = vector_query_engines[city]


query_engine_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name=city, description=f"Provides information about {city}"
),
)
query_engine_tools.append(query_engine_tool)


s_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools, llm=OpenAI(model="gpt-3.5-turbo")
)


# Convert engines to tools and combine them together
sql_tool = QueryEngineTool.from_defaults(
query_engine=sql_query_engine,
description=(
"Useful for translating a natural language query into a SQL query over"
" a table containing: city_stats, containing the population/country of"
" each city"
),
)
s_engine_tool = QueryEngineTool.from_defaults(
query_engine=s_engine,
description=(
f"Useful for answering semantic questions about different cities"
),
)

query_engine = SQLJoinQueryEngine(
sql_tool, s_engine_tool, llm=OpenAI(model="gpt-4")
)
# Issue query
response = query_engine.query(
"Tell me about the arts and culture of the city with the highest"
" population"
)

The final answer combines from two parts:

One is the answer from SQL engine, the data agent generates a SQL statement “SELECT city_name, population, country FROM city_stats ORDER BY population DESC LIMIT 1” from natural language and gets an answer from the structured data that Tokyo has the highest population.

Based on the first answer, the data agent then generates three sub questions “Can you provide more details about the museums, theaters, and performance venues in Tokyo?” regarding art and culture in Tokyo.

The final answer combines the two parts and is shown below:

Final response: The city with the highest population is Tokyo, Japan. Tokyo is known for its vibrant arts and culture scene, with a mix of traditional and modern influences. From ancient temples and traditional tea ceremonies to cutting-edge technology and contemporary art galleries, Tokyo offers a diverse range of cultural experiences for visitors and residents alike. The city is also home to numerous museums, theaters, and performance venues showcasing the rich history and creativity of Japan. Unfortunately, I cannot provide more details about the museums, theaters, and performance venues in Tokyo based on the context information provided.

So what’s next?

The demo here shows how to use Gravitino for data ingestion and LlamaIndex for efficient data retrieval. With Gravitino’s production-ready features, it is easy for users to build a universal data agent. We’re continually improving Gravitino to make it a key component to building a data agent that meets enterprise-grade standards.

Ready to take your data agent to the next level? Dive into the guides and join our ASF Community Slack Channel for support.

Huge thanks to co-writer Jerry Shao for collaborating with me on this.

--

--

Datastrato
Datastrato

Published in Datastrato

World’s most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

Lisa N. Cao
Lisa N. Cao

Written by Lisa N. Cao

Data engineer-y, DataOps-y, Open Source-y, Product-y type person. https://linktr.ee/lisancao

No responses yet