Building graph vector search using VectorETL

Babajide Ogunjobi
Context Data
Published in
6 min read14 hours ago

We launched VectorETL just over 2 months ago and every few days, we stumble on new and interesting ways to use VectorETL to make vector data processing faster. As part of use cases that we explore, we are always looking for opportunities where we can easily integrate VectorETL into an already existing AI focused workflow (RAG, Enterprise Search etc.)

One of the more interesting vector target integrations we have is Neo4j, which is one of the most widely used graph databases in the world. Using graphs (and by extension Neo4j) for Generative AI use cases has been a hot topic in the last few years due to its superior search and retrieval capabilities.

In this article, we’ll explore how to use VectorETL to quickly and easily build a powerful graph search application using Neo4j. We’ll cover the entire process from data ingestion to querying the graph using natural language, leveraging the power of vector embeddings and large language models.

Building Steps:

1). Use VectorETL to easily build a graph schema in a yaml configuration file and write data to Neo4j

2). Write a simple script which allows us to ask a Chat LLM (OpenAI) questions in natural

Prerequisite (Get a Neo4j database)

Neo4j provides most users with a free account where you can create an entry level graph. You can just go to the Neo4j site here and create an account, create a database and get your connection details.

Defining the graph structure/model

Using sample data we retrieved from Kaggle, we start with defining the graph model. One of the fantastic things about VectorETL is the ability to define the schema of your Neo4j graph without having to write multiple lines of cypher. You can define it using a yaml or json configuration file and VectorETL will extrapolate the graph from there.

For example, using the fashion data we retrieved, we created the configuration file below.

source:
source_data_type: "Local File"
file_path: "/Downloads/fashion_products.csv" #absolute location of file
file_type: "csv"
chunk_size: 1000
chunk_overlap: 0

embedding:
embedding_model: "OpenAI"
api_key: "open-ai-key"
model_name: "text-embedding-ada-002"

target:
target_database: "Neo4j"
neo4j_uri: "bolt+s://my-neo4j-uri:7687"
username: "neo4j"
password: "my-neo4j-password"
vector_property: "embedding"
vector_dimensions: 1536 # Adjust based on your embedding size
similarity_function: "cosine" # or "euclidean"
graph_structure:
nodes:
- label: "Product"
properties:
- "ProductID"
- "ProductName"
- "Brand"
- "Price"
- "Rating"
- label: "User"
properties:
- "UserID"
unique: true
- label: "Brand"
properties:
- "Brand"
unique: true
- label: "Size"
properties:
- "Size"
unique: true
- label: "Color"
properties:
- "Color"
unique: true
relationships:
- start_node: "Product"
end_node: "User"
type: "BOUGHT_BY"
- start_node: "Product"
end_node: "Brand"
type: "MADE_BY"
unique: true
- start_node: "Product"
end_node: "Color"
type: "HAS_COLOR"
unique: true
- start_node: "Product"
end_node: "Size"
type: "HAS_SIZE"
unique: true
- start_node: "User"
end_node: "Size"
type: "BOUGHT_SIZE"
unique: true
- start_node: "User"
end_node: "Color"
type: "LIKES_COLOR"
unique: true

embed_columns: []

1). The source key defines the location of the source data. In this case, the source_data_type is “Local File” where also define the location of the local file

2). The embedding key defines which embedding model we want to apply to the extracted data. VectorETL supports OpenAI, Cohere, Google Gemini and Hugging Face but we’re going to use OpenAI for this demonstration

3). The target key is where we define the vector database or store where the embeddings will be written to. VectorETL supports all major vector databases but in this case, we’re writing the data to Neo4j.

  • Within the graph structure key, you’ll start to see where we’re defining the graph nodes and relationships
  • We define nodes for the product, user, size, color and brand
  • For the edges, we define relationships using the start_node, end_node and relationship type

4). Lastly, we define the embedding columns key. Given that the source data is a file based source, we’ll just leave that as an empty list.

For more information on how to build a VectorETL configuration file, check out the documentation here.

Now that the configuration file has been created, we can just run the VectorETL process. There are 2 options (command line & python import) to run VectorETL with the yaml file but we’ll use the python import option here.

from vector_etl import create_flow

flow = create_flow()
flow.load_yaml('/path/to/my/config_file.yaml')
flow.execute()

That’s it! This is how easy it is to run the VectorETL process (literally 4 lines!!)

If you have VectorETL installed on your computer and your configuration file is accurate, the whole process should run and be completed in a few minutes.

If you go to your Neo4j Aura database, you should see a graph similar to this

Building a simple RAG pipeline

Now that our data is in Neo4j, let’s start working on writing the script that will achieve this. My thought process in designing this script was as follows

1). Using the graph structure defined earlier in the yaml file, we first pass that to a Chat LLM to give it context on the structure of the graph.

  • When building graph based RAG applications, this is one of the most difficult parts because a lot of people tend to ask open ended questions and expect the LLM to figure out the graph structure.
  • However, this has been made super easy for the LLM because we already defined it in the configuration file

2). Now that the LLM knows exactly what the graph structure looks like, I can send my question to the LLM where it will first parse the question against the graph nodes, properties and relationships and then generate an accurate (or close to accurate) cypher query and execute it against the database

3). Using the results generated from the query and the original query, the chat LLM will compare both and use them to answer the question

Here’s how I implemented this logic in python. You will also notice that I’m still making use of my yaml config file especially for the graph structure and the OpenAI credentials

import yaml
from openai import OpenAI
from neo4j import GraphDatabase


class RAGNeo4j:
def __init__(self, config_path):
with open(config_path, 'r') as file:
self.config = yaml.safe_load(file)

self.neo4j_driver = GraphDatabase.driver(
self.config['target']['neo4j_uri'],
auth=(self.config['target']['username'], self.config['target']['password'])
)


def generate_cypher_query(self, question):
prompt = f"""
Given the following graph structure:
{self.config['target']['graph_structure']}

And the user's question: "{question}"

Generate a Cypher query to retrieve relevant information from the graph and ONLY return the cypher query in text format. DO NOT ADD ```cypher
"""

client = OpenAI(api_key=self.config['embedding']['api_key'])
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{"role": "system",
"content": "You are a helpful assistant that generates Cypher queries based on natural language questions and a given graph structure."},
{"role": "user", "content": prompt}
],
temperature=0.8
)

return response.choices[0].message.content

def execute_cypher_query(self, query):
with self.neo4j_driver.session() as session:
print(query)
result = session.run(query)
return [record.data() for record in result]

def generate_answer(self, question, query_results):
prompt = f"""
Question: {question}

Graph database results: {query_results}

Please provide a concise answer to the question based on the given information.
"""

system_message = "You are a helpful assistant that answers questions about fashion clothing based on the provided graph database results."
client = OpenAI(api_key=self.config['embedding']['api_key'])
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": prompt}
],
temperature=0.8
)

return response.choices[0].message.content

def answer_question(self, question):
cypher_query = self.generate_cypher_query(question)
query_results = self.execute_cypher_query(cypher_query)
answer = self.generate_answer(question, query_results)
return answer

def close(self):
self.neo4j_driver.close()


# Example usage
if __name__ == "__main__":
config_path = "/path/to/my/config_file.yaml"
rag = RAGNeo4j(config_path)

while True:
question = input("Enter your question (or 'quit' to exit): ")
if question.lower() == 'quit':
break

try:
answer = rag.answer_question(question)
print(f"\nQuestion: {question}")
print(f"Answer: {answer}\n")
except Exception as e:
print(f"An error occurred: {str(e)}")

rag.close()
print("Thank you for using the RAG system. Goodbye!")

Here’s how it looked when I execute the script

QED

You can access the full codebase here

Over the next few weeks, we’ll publish more great use cases and recipes using VectorETL along with new integrations.

If you have any suggestions or requests, don’t hesitate to send them to the team at info@contextdata.ai

--

--