LangGraph AI Agents with Knowledge Graph
A RAG System Built for Both Semantic Search and Structured Data Queries
Retrieval-Augmented Generation (RAG) has transformed the way we leverage large language models (LLMs) by augmenting them with external knowledge. A typical RAG system indexes documents (or other data) as vector embeddings, performs similarity search, and injects relevant information into the LLM’s context. This is powerful for dealing with unstructured text. However, real-world enterprise data often involve structured components — like numeric attributes, aggregations, and relationships — which text embeddings do not inherently handle well.
In this blog, we will explore how integrating Knowledge Graphs (specifically, Neo4j) and structured tools enables us to handle more complex queries, while preserving the strengths of embedding-based search for unstructured text. By the end, you’ll see how to build a robust RAG application, that can handle both unstructured and structured queries effectively. I have included every single thing that you’ll need to learn and experiment with this approach in this blog from the data to complete code to explanations.
Link for the data: Data
Link for creating Neo4j instance used in this blog: Graph Instance
Understanding RAG Applications
A RAG application typically involves:
- Vector Store / Embeddings: Convert unstructured text into vector embeddings.
- Retrieval: Find the most relevant documents or knowledge pieces using similarity search.
- Generation: Use an LLM to process and format an answer to the user’s query.
However, the Achilles’ heel of purely embedding-centric approaches is dealing with structured or relational data (e.g., numeric filters, sorting, aggregations). This is where knowledge graphs excel.
Limitations of Text Embeddings
Text embeddings are widely used in RAG applications, but they come with notable limitations when it comes to processing structured data. These limitations can hinder the accuracy and effectiveness of your queries, particularly when dealing with real-world data scenarios.
Filtering and Sorting
Text embeddings excel at understanding semantic content, but they cannot naturally process operations like filtering or sorting. For example, when you want to retrieve suppliers with a capacity greater than 40,000, text embeddings fall short because they aren’t designed to handle numeric filters. They cannot apply conditions like “greater than” or “less than” on numeric fields, which are essential in many data-driven queries.
Aggregations
Another area where text embeddings struggle is in performing aggregations. Consider queries like “How many suppliers are in Europe?” that require counting or grouping data. Text embeddings, on their own, do not provide straightforward ways to aggregate data. While they are great for semantic search, tasks like counting, summing, or grouping require structured operations that embeddings cannot perform directly.
Data Complexity
In real-world production systems, data often have complex relationships that go beyond simple associations. Embeddings are designed to capture semantic meaning but can miss intricate, relational connections between entities. For example, in a knowledge graph, entities like suppliers, locations, and products are interconnected in a way that requires understanding both the entities and their relationships. Embeddings alone do not address these complexities effectively, making it difficult to extract accurate, relationship-based insights.
Why Knowledge Graphs Matter
A Knowledge Graph is a powerful data structure that represents knowledge in a way that captures relationships between different entities. Instead of representing data in isolated tables or lists, a Knowledge Graph models data using nodes (representing entities) and edges (representing relationships between those entities). This approach is especially useful for capturing the complex, interconnected nature of real-world data.
For instance, in a business context, entities like suppliers, locations, and products can be modeled as nodes, with relationships such as “supplies” or “located_in” linking them together. This structure enables a more intuitive understanding of how different elements of data are related, making it ideal for tasks that require navigating complex data connections.
Store Structured Data
In a Knowledge Graph, each node can have properties, such as supply capacity, name, or location, which describe the entity it represents. Relationships between nodes can also have properties, enabling rich, contextual data representation. For example, a supplier node may be connected to a location node with a “located_in” relationship that includes properties like the city or country.
Query with Cypher
Neo4j, a popular graph database, uses Cypher, a declarative query language, to interact with the graph. Cypher is specifically designed to work with graphs, making it easier to perform complex queries that involve filtering, aggregations, and path matching. For example, you can use Cypher to find suppliers with a certain supply capacity, or find the shortest path between two locations in the graph.
Combine Unstructured and Structured
One of the biggest advantages of Knowledge Graphs is their ability to combine structured and unstructured data. By adding embedding properties to nodes, you can store vector embeddings (representing unstructured data such as text descriptions) alongside the structured data in the graph. This enables you to perform semantic searches (for example, finding suppliers related to “raw materials”) alongside traditional, structured queries like “find suppliers with a capacity greater than 40,000.” This fusion of structured and unstructured data within a single database allows you to run comprehensive queries and make better-informed decisions.
Marrying Knowledge Graphs with Language Models
The key to integrating Knowledge Graphs (KGs) with Language Models (LLMs) is to leverage the strengths of each technology. LLMs are exceptional at understanding complex language, interpreting context, and generating coherent, human-like responses. However, when it comes to executing structured queries, such as filtering, sorting, or performing aggregations on large datasets, LLMs are not optimal. This is where Knowledge Graphs, like Neo4j, shine by providing efficient query handling for structured data.
By offloading the complexity of generating database queries (such as Cypher for Neo4j) to dedicated tool functions, you can ensure the application remains robust, reliable, and accurate. This division of labor allows LLMs to focus on their natural language processing strengths, while Knowledge Graphs handle structured data efficiently. As a result, you can build a system that combines the best of both worlds, making it capable of handling both unstructured language data and structured, relationship-based queries seamlessly
Project Overview: Supplier Management
We’ll implement a system to store, search, and retrieve suppliers from around the world, each with:
- A name
- A location
- A supply capacity (numeric)
- A short description
The system must handle:
- Filtering based on supply capacity.
- Counting and grouping.
- Vector similarity search based on descriptions.
- Aggregations (e.g., grouping by location).
Installing Required Libraries
Open a notebook or terminal and install the following libraries:
!pip install --quiet neo4j pyvis langchain-community langchain-openai langgraph
These include:
- neo4j for Python driver to connect to Neo4j.
- pyvis for visualizing graphs (optional).
- langchain-community for community tools around LangChain.
- langchain-openai for using OpenAI LLMs and embeddings.
- langgraph for building state machines (like ReAct agents) with graph logic.
Configuring Environment Variables
To securely connect to Neo4j and OpenAI, we store the required credentials as environment variables. This allows the code to access sensitive information without hardcoding it. Here’s how we set them up:
import os
NEO4J_URI = "neo4j+s://<your-instance-id>.databases.neo4j.io"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "<your-password>"
OPENAI_API_KEY = "<your-openai-key>"
os.environ["NEO4J_URI"] = NEO4J_URI
os.environ["NEO4J_USERNAME"] = NEO4J_USER
os.environ["NEO4J_PASSWORD"] = NEO4J_PASSWORD
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
This code sets up the Neo4j and OpenAI credentials as environment variables, ensuring they can be securely accessed throughout the application.
Connecting to Neo4j
We will need to connect to an instance of Neo4j to be able to perform data ingestion as well as querying, the below code takes care of that.
from langchain_community.graphs import Neo4jGraph
graph = Neo4jGraph(refresh_schema=False)
Neo4jGraph
takes care of establishing a connection and can optionally refresh the schema if needed.
Building the Knowledge Graph in Neo4j
Data Ingestion with CSV
You have two CSV files:
- nodes.csv with columns like id, location, name, description
- relationships.csv with columns like START_ID, END_ID, TYPE
Below is the data ingestion script:
import csv
import numpy as np
from neo4j import GraphDatabase
NODES_CSV = "nodes.csv"
RELATIONSHIPS_CSV = "relationships.csv"
def get_label_for_type(node_type):
mapping = {
"Supplier": "Supplier",
"Manufacturer": "Manufacturer",
"Distributor": "Distributor",
"Retailer": "Retailer",
"Product": "Product"
}
return mapping.get(node_type, "Entity")
def ingest_nodes(driver):
with driver.session() as session:
with open(NODES_CSV, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
node_id = row['id:ID']
name = row['name']
node_type = row['type']
location = row['location']
supply_capacity = np.random.randint(1000, 50001)
description = row['description']
label = get_label_for_type(node_type)
if location.strip():
query = f"""
MERGE (n:{label} {{id:$id}})
SET n.name = $name, n.location = $location,
n.description = $description, n.supply_capacity = $supply_capacity
"""
params = {
"id": node_id,
"name": name,
"location": location,
"description": description,
"supply_capacity": supply_capacity
}
else:
query = f"""
MERGE (n:{label} {{id:$id}})
SET n.name = $name
"""
params = {"id": node_id, "name": name}
session.run(query, params)
def ingest_relationships(driver):
with driver.session() as session:
with open(RELATIONSHIPS_CSV, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
start_id = row[':START_ID']
end_id = row[':END_ID']
rel_type = row[':TYPE']
product = row['product']
if product.strip():
query = f"""
MATCH (start {{id:$start_id}})
MATCH (end {{id:$end_id}})
MERGE (start)-[r:{rel_type} {{product:$product}}]->(end)
"""
params = {
"start_id": start_id,
"end_id": end_id,
"product": product
}
else:
query = f"""
MATCH (start {{id:$start_id}})
MATCH (end {{id:$end_id}})
MERGE (start)-[r:{rel_type}]->(end)
"""
params = {
"start_id": start_id,
"end_id": end_id
}
session.run(query, params)
ingest_nodes
function reads supplier data from a CSV and creates nodes in Neo4j.- Each node represents a supplier with properties like name, location, and supply_capacity.
- The
MERGE
command ensures nodes are created or updated without duplication. ingest_relationships
function reads relationships from a CSV and creates connections between nodes.- Relationships between supplier and product nodes are created, with properties like product.
- The
MERGE
command ensures relationships are unique.
Creating Indexes
Creating indexes or constraints in Neo4j is crucial for performance and uniqueness:
def create_indexes(driver):
with driver.session() as session:
for label in ["Supplier", "Manufacturer", "Distributor", "Retailer", "Product"]:
session.run(f"CREATE CONSTRAINT IF NOT EXISTS FOR (n:{label}) REQUIRE n.id IS UNIQUE")
- This code ensures that each node in the graph has a unique identifier by creating a unique constraint on the
id
property for each type of node (e.g., Supplier, Manufacturer). - The
CREATE CONSTRAINT
statement optimizes querying and avoids duplicate data.
Running the Ingestion
The code given below brings everything defined earlier together and starts the data ingestion process in our Neo4j instance.
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
create_indexes(driver)
ingest_nodes(driver)
ingest_relationships(driver)
print("Data ingestion complete.")
driver.close()
Once, the ingestion is complete, you can visualize the schema in your Neo4j AuraDB instance.
Vector Embeddings in Neo4j
Even though we have structured data, we still want to incorporate text embeddings for descriptions. By storing vector embeddings in Neo4j, you can do semantic queries (e.g., find suppliers similar to a query text).
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Neo4jVector
embedding = OpenAIEmbeddings(model="text-embedding-3-small")
neo4j_vector = Neo4jVector.from_existing_graph(
embedding=embedding,
index_name="supply_chain",
node_label="Supplier",
text_node_properties=["description"],
embedding_node_property="embedding",
)
With this configuration:
node_label="Supplier"
restricts embedding storage to supplier nodes.text_node_properties=["description"]
indicates which properties are embedded.embedding_node_property="embedding"
is where we store the resulting embedding vectors.
We can now do semantic similarity queries inside Neo4j, bridging structured data with unstructured text searches.
Tooling for Structured Queries
To handle structured queries (like counting or listing suppliers with numeric filters), we define tools. Each tool is essentially a function that the LLM can invoke. We use Pydantic to specify the expected inputs, ensuring clarity and type-checking.
Supplier Count Tool
We need a tool to count suppliers based on optional filters such as minimum supply capacity, maximum supply capacity, or grouping by a property (e.g., location). The following code defines the input schema using Pydantic and the function implementation to query Neo4j and return the count of suppliers.
Input Schema
We need to specify the input schema to ensure right information is sent to the tool.
from pydantic import BaseModel, Field
from typing import Optional, Dict, List
class SupplierCountInput(BaseModel):
min_supply_amount: Optional[int] = Field(
description="Minimum supply amount of the suppliers"
)
max_supply_amount: Optional[int] = Field(
description="Maximum supply amount of the suppliers"
)
grouping_key: Optional[str] = Field(
description="The key to group by the aggregation",
enum=["supply_capacity", "location"]
)
SupplierCountInput
defines the input structure for the tool. It includes optional fields for filtering suppliers by supply capacity. It also provides the ability to group the results (e.g., by location).
Function Implementation
In the below code block, we have defined the function needed to return supplier count.
import re
from langchain_core.tools import tool
def extract_param_name(filter: str) -> str:
pattern = r'\$\w+'
match = re.search(pattern, filter)
if match:
return match.group()[1:]
return None
@tool("supplier-count", args_schema=SupplierCountInput)
def supplier_count(
min_supply_amount: Optional[int],
max_supply_amount: Optional[int],
grouping_key: Optional[str],
) -> List[Dict]:
"""Calculate the count of Suppliers based on particular filters"""
filters = [
("t.supply_capacity >= $min_supply_amount", min_supply_amount),
("t.supply_capacity <= $max_supply_amount", max_supply_amount)
]
params = {
extract_param_name(condition): value
for condition, value in filters
if value is not None
}
where_clause = " AND ".join(
[condition for condition, value in filters if value is not None]
)
cypher_statement = "MATCH (t:Supplier) "
if where_clause:
cypher_statement += f"WHERE {where_clause} "
return_clause = (
f"t.{grouping_key}, count(t) AS supplier_count"
if grouping_key
else "count(t) AS supplier_count"
)
cypher_statement += f"RETURN {return_clause}"
print(cypher_statement) # Debugging output
return graph.query(cypher_statement, params=params)
How It Works
extract_param_name
: This helper function extracts parameter names from the filter conditions (e.g.,$min_supply_amount
).supplier_count
: This tool function constructs a Cypher query to count suppliers based on the given filters and optionally group the result by a property like location.- It dynamically builds the query, adds the appropriate filters, and executes it using Neo4j’s query system.
Supplier List Tool
We need a tool to list suppliers, optionally sorted by supply capacity, filtered by capacity, and possibly performing a vector search if as description
is provided.
Input Schema
We need to specify the input schema to ensure right information is sent to the tool.
class SupplierListInput(BaseModel):
sort_by: str = Field(description="How to sort Suppliers by supply capacity", enum=['supply_capacity'])
k: Optional[int] = Field(description="Number of Suppliers to return")
description: Optional[str] = Field(description="Description of the Suppliers")
min_supply_amount: Optional[int] = Field(description="Minimum supply amount of the suppliers")
max_supply_amount: Optional[int] = Field(description="Maximum supply amount of the suppliers")
SupplierListInput
defines the input schema for this tool, with parameters for filtering by supply capacity, sorting by a specified key (e.g., supply_capacity
), and an optional description for vector-based search.
Function Implementation
In the below code block, we have defined the function needed to return supplier list.
@tool("supplier-list", args_schema=SupplierListInput)
def supplier_list(
sort_by: str = "supply_capacity",
k : int = 4,
description: Optional[str] = None,
min_supply_amount: Optional[int] = None,
max_supply_amount: Optional[int] = None,
) -> List[Dict]:
"""List suppliers based on particular filters"""
# Handle vector-only search when no prefiltering is applied
if description and not min_supply_amount and not max_supply_amount:
return neo4j_vector.similarity_search(description, k=k)
filters = [
("t.supply_capacity >= $min_supply_amount", min_supply_amount),
("t.supply_capacity <= $max_supply_amount", max_supply_amount)
]
params = {
key.split("$")[1]: value for key, value in filters if value is not None
}
where_clause = " AND ".join([condition for condition, value in filters if value is not None])
cypher_statement = "MATCH (t:Supplier) "
if where_clause:
cypher_statement += f"WHERE {where_clause} "
# Sorting and returning
cypher_statement += " RETURN t.name AS name, t.location AS location, t.description as description, t.supply_capacity AS supply_capacity ORDER BY "
if description:
cypher_statement += (
"vector.similarity.cosine(t.embedding, $embedding) DESC "
)
params["embedding"] = embedding.embed_query(description)
elif sort_by == "supply_capacity":
cypher_statement += "t.supply_capacity DESC "
else:
# Fallback or other possible sorting
cypher_statement += "t.year DESC "
cypher_statement += " LIMIT toInteger($limit)"
params["limit"] = k or 4
print(cypher_statement) # Debugging output
data = graph.query(cypher_statement, params=params)
return data
supplier_list
: This tool lists suppliers, optionally filtering by supply capacity and sorting by supply capacity or other criteria. If a description is provided, it performs a vector similarity search using Neo4j’s vector store to find suppliers matching the description. The function also handles query formation and executes it in Neo4j.
Key Points:
- Vector Search: If only a description is given, it relies solely on the Neo4j vector store.
- Combined Structured + Vector: If other filters are also provided, it builds a Cypher query that orders by vector similarity or supply_capacity.
- Robustness: By strictly encoding how queries are formed, it minimizes errors and maintains clarity.
Integrating LangChain + LangGraph
In this section, we create a ReAct-style agent that uses LangGraph to decide when to invoke tools like supplier-count and supplier-list. LangChain is used for managing the LLM interface, while LangGraph defines the flow of the agent
Constructing the Agent:
Below is the code snippet that helps us in creating agent with tools bound to the large language model.
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatOpenAI(model='gpt-4-turbo')
tools = [supplier_count, supplier_list]
llm_with_tools = llm.bind_tools(tools)
sys_msg = SystemMessage(content="You are a helpful assistant tasked with finding and explaining relevant information about Supply chain")
llm_with_tools
: This binds theChatOpenAI
model to the tools (supplier_count
andsupplier_list
), enabling the LLM to invoke these tools as needed during its operation.sys_msg
: Sets the initial system message for the agent, instructing it on its role.
Defining the Flow with LangGraph
We define two nodes:
- assistant: uses the LLM to parse the message and decide if a tool call is needed.
- tools: executes any tool requests.
- The conditional edge checks if the LLM’s last message includes a tool call. If so, run the tool; if not, terminate.
from langgraph.graph import StateGraph, START, MessagesState
from langgraph.prebuilt import tools_condition, ToolNode
from IPython.display import Image, display
def assistant(state: MessagesState):
return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])]}
builder = StateGraph(MessagesState)
builder.add_node("assistant", assistant)
builder.add_node("tools", ToolNode(tools))
# Define edges:
builder.add_edge(START, "assistant")
# If there's a tool call, go to 'tools'; else finish
builder.add_conditional_edges("assistant", tools_condition)
builder.add_edge("tools", "assistant")
react_graph = builder.compile()
display(Image(react_graph.get_graph(xray=True).draw_mermaid_png()))
Demonstration and Testing
We will send various queries to the agent and observe how it decides to call the tools (supplier-count
or supplier-list
).
Counting Suppliers
supplier-count
tool is invoked when the LLM identifies that a count of suppliers is needed based on the given filters. The result is then returned to the user.
from langchain_core.messages import HumanMessage
messages = [
HumanMessage(
content="How many suppliers have supply capacity more than 20000 and is located in Oslo?"
)
]
result = react_graph.invoke({"messages": messages})
for m in result["messages"]:
m.pretty_print()
- The LLM identifies it needs a count of suppliers → invokes
supplier-count
. - The tool returns the number of suppliers.
- The LLM relays this to the user.
Listing Suppliers
supplier-list
tool is called when the query asks for suppliers above a certain supply capacity. The agent processes the request and retrieves the relevant suppliers
messages = [
HumanMessage(
content="What are the suppliers having capacity above 40000?"
)
]
result = react_graph.invoke({"messages": messages})
for m in result["messages"]:
m.pretty_print()
Here the LLM calls supplier-list
to retrieve suppliers above 40,000
.
Combined Queries
If you provide a query that includes a description plus numeric filters, the agent will combine vector similarity with a structured query in Neo4j:
messages = [
HumanMessage(
content="Find suppliers that deal with steel and have at least 20000 supply capacity."
)
]
result = react_graph.invoke({"messages": messages})
for m in result["messages"]:
m.pretty_print()
The tool will build a WHERE clause for supply capacity >= 20000 and also use a vector property for semantic matching on descriptions containing “dealing with steel.”
Conclusion
By moving beyond embedding-only queries and incorporating knowledge graphs + structured tools, we unlock a powerful synergy:
- Precision: Numeric filters, counts, and grouping are handled flawlessly by Neo4j’s Cypher queries.
- Semantic Relevance: Text embeddings still excel at matching descriptions semantically.
- Stability: We keep generation-based query creation to a minimum, instead relying on deterministic function calls. This approach greatly reduces errors in production.
- Scalability: As your schema grows, you can expand your toolkit with more specialized tools. The LLM orchestrates them rather than having to generate complicated queries autonomously.
You’ve just built a fully functional RAG system that combines structured and unstructured data into a single pipeline, ensuring the best of both worlds — semantic search via text embeddings and robust, precise queries via Neo4j’s knowledge graph.
Frameworks used in this code
Feel free to expand on these ideas for your own use cases: add more tools for advanced analytics, incorporate additional vector indexes, or apply complex relationship traversals (e.g., supply chain path analysis). The possibilities are limitless, and now you have the blueprint for building reliable, production-grade RAG applications.