Mastering Advanced(RAG) Methods — GraphRAG with Neo4j | Implementation with Langchain

5 min readJul 30, 2024

Graph retrieval-augmented generation (GraphRAG) is gaining momentum and becoming a powerful addition to traditional vector search retrieval methods. This approach leverages the structured nature of graph databases, which organize data as nodes and relationships, to enhance the depth and contextuality of retrieved information.

Here we will directly jump into theimplementation of GraphRAG using Neo4j and langchain.

Hybrid Retrieval for RAG

Installing python dependencies:

Before diving into the code, we need to install the necessary libraries. This implementation relies on langchain, unstructured, neo4j, openai, yfiles_jupyter_graphs, and several other dependencies.

%pip install langchain unstructured[all-docs] pydantic lxml pytesseract json-repair
%pip install openai wikipedia tiktoken neo4j python-dotenv yfiles_jupyter_graphs
%pip install --upgrade --quiet  langchain langchain-community langchain-openai langchain-experimental neo4j

Importing Required Libraries

We start by importing all necessary libraries to work with document parsing, vector embeddings, chat models, graph transformations, and Neo4j.

import os
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition_pdf
from langchain.docstore.document import Document
from langchain.vectorstores import Neo4jVector
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import uuid
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
from langchain.chat_models import ChatOpenAI
from langchain.chains import GraphCypherQAChain
from langchain.graphs import Neo4jGraph
from langchain_core.prompts.prompt import PromptTemplate
from dotenv import load_dotenv
from langchain_core.documents import Document
from langchain_core.runnables import (
    RunnablePassthrough,
)
from yfiles_jupyter_graphs import GraphWidget
from neo4j import GraphDatabase
from langchain_experimental.graph_transformers import LLMGraphTransformerpyth

Setting Up Environment Variables

Ensure your environment variables are correctly set up for the OpenAI API and Neo4j database credentials.

load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
os.environ["NEO4J_URI"] = os.getenv('NEO4J_URI')
os.environ["NEO4J_USERNAME"] = os.getenv('NEO4J_USERNAME')
os.environ["NEO4J_PASSWORD"] = os.getenv('NEO4J_PASSWORD')

PDF Parsing Using Unstructured Libraries

We use the unstructured library to parse a PDF document into its core components.

raw_pdf_elements = partition_pdf(
    filename=path + "input/2024q2-alphabet-earnings-release.pdf",
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path+'img',
)

Process and categorize the parsed elements:

id_key = "doc_id"
categorized_elements = []
documents = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        documents.append(Document(page_content=str(element), metadata={id_key: str(uuid.uuid4())}))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        documents.append(Document(page_content=str(element), metadata={id_key: str(uuid.uuid4())}))

table_elements = [e for e in categorized_elements if e.type == "table"]
text_elements = [e for e in categorized_elements if e.type == "text"]

Generating Graph Nodes and Relationships

Using the LLMGraphTransformer from langchain_experimental, we generate graph nodes and relationships from the parsed documents.

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")
llm_transformer = LLMGraphTransformer(llm=llm)

graph_documents = llm_transformer.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")

Storing Graph Documents

Initialize a Neo4j instance and store the generated graph documents.

graph = Neo4jGraph()
graph.add_graph_documents(
    graph_documents,
    baseEntityLabel=True,
    include_source=True
)

Visualize the stored graph nodes:

default_cypher = "MATCH (s)-[r:!MENTIONS]->(t) RETURN s,r,t LIMIT 50"
def showGraph(cypher: str = default_cypher):
    driver = GraphDatabase.driver(
        uri = os.environ["NEO4J_URI"],
        auth = (os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"])
    )
    session = driver.session()
    widget = GraphWidget(graph = session.run(cypher).graph())
    widget.node_label_mapping = 'id'
    return widget

showGraph()

Querying the Graph Database

The GraphCypherQAChain helps form queries to fetch data from the graph database based on the provided schema and question prompts.

CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
...
"""
CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

CYPHER_QA_TEMPLATE = """You are an assistant that helps to form nice and human understandable answers.
...
"""
CYPHER_QA_PROMPT = PromptTemplate(
    input_variables=["context", "question"], template=CYPHER_QA_TEMPLATE
)

graph_chain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0), 
    graph=graph, 
    cypher_prompt=CYPHER_GENERATION_PROMPT, 
    qa_prompt=CYPHER_QA_PROMPT,
    verbose=True
)

Graph nodes and realtions are now stored. It’s time to store the unstructured data.

Storing Document Vector Embeddings

Store the vector embeddings of documents using Neo4jVector:

from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
db = Neo4jVector.from_documents(
    documents, 
    OpenAIEmbeddings(),
)

Creating a Hybrid Retriever

Create a hybrid retriever for combining graph database search, semantic similarity search, and full text search:

index_name = "vector"
keyword_index_name = "keyword"
store = Neo4jVector.from_existing_index(
    OpenAIEmbeddings(),
    index_name=index_name,
    keyword_index_name=keyword_index_name,
    search_type='hybrid'
)

graph_chain.run("most used topic in the given dataset?")

Define a retriever chain function:

def retriever(question: str):
    print(f"Search query: {question}")
    structured_data = graph_chain.run(question)
    unstructured_data = [el.page_content for el in store.similarity_search(question)]
    final_data = f"""Structured data:
{structured_data}
Unstructured data:
{"#Document ". join(unstructured_data)}
    """
    print(f"Final context: {final_data}")
    return final_data

Define LangChain Retriever Chain

Create a GQL retriever and define the langchain retriever chain:

template = """Answer the question based only on the following context:
{context}

Question: {query}
...

suite = ChatPromptTemplate.from_template(template)

retrieval_chain = (
    {"context": retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

retrieval_chain.invoke("What are the top themes in this dataset?")
retrieval_chain.invoke("Most discussed clauses?")

Everything is done, we have set up the Graph DB with graph nodes, also we have created a hybrid index retriever which will do both semantic similarity search and full text search. It’s time to query and enjoy.

retrieval_chain.invoke("What are the top themes in this dataset?")

As demonstrated, integrating GraphRAG with Neo4j offers a robust solution for parsing, storing, and querying unstructured documents. Traditional graph methods often fall short in providing relevant context or answers to queries like “What are the top themes in this dataset?”. In contrast, GraphRAG excels in these areas by efficiently managing and interpreting complex relationships within the data.

This makes GraphRAG particularly valuable for scenarios requiring deep data analysis and the extraction of meaningful relationships from extensive datasets. Whether for academic research, enterprise data management, or enhancing AI applications, GraphRAG significantly outperforms standard RAG methods, making it a crucial tool for modern data analysis. Embrace the power of GraphRAG and elevate your data insights. Happy learning!