Documents Are Property Graph

Fanghua (Joshua) Yu
12 min readOct 20, 2023

--

Leveraging a semantic-rich & schema-flexible graph database for successful RAG Solutions

The dome of Chatstone Shopping Centre in Melbourne. Photo by author.

Abstract

Retrieval Augmented Generation (RAG) combines the generative capabilities of large language models (LLMs) like GPT with the dynamic information retrieval of external databases, allowing for up-to-date, factual, domain-specific, and detailed responses. This makes RAG versatile for tasks like question-answering, drawing from vast and evolving data sources. So far, there have been a lot researches, blog posts and solutions focusing on improving retrieval and generation, but few on the storage type of knowledge. In this post, I will walk through the key features of a RAG solution and explain why a Property Graph Database is in fact the best solution for successful implementation.

Knowledge Store — The Less Mentioned But Critical Piece of RAG

So far, there have been a lot researches, blog posts and solutions focusing on improving retrieval and generation, using innovative approaches like text embedding, vector index, truncking strategy etc., but few on the storage type of knowledge.

A knowledge store (or knowledge base) is a centralized, structured, and frequently updated repository of knowledge. For RAG, which relies on retrieval-augmented mechanisms, having a robust knowledge store ensures that the information being retrieved is comprehensive, current, and accurate. Without a well-maintained knowledge store, RAG’s potential is limited, as it might pull outdated or incomplete data, undermining the effectiveness of the generated responses.

While traditional document and relational databases have their strengths, the inherent interconnectedness of document data makes graph databases an even better attractive option. The graph data model can natively represent, query, and analyze the complex relationships found in and between documents. As the knowledge store, Native Property Graph certainly has the advantages:

  1. Rich Representation:

Documents are inherently hierarchical and interconnected, with entities, metadata, references, and other relationships.
A graph database, with its nodes and edges, can natively represent the complex interrelations found in documents. For example, a research paper with authors, citations, institutions, and topics can be easily represented as a graph.

2. Flexible Schema:

The structure of documents can evolve over time. Some documents may have fields or relationships that others don’t. Unlike relational databases that may require schema alterations to accommodate changes, graph databases can easily accommodate evolving data models, allowing for the addition of new nodes and relationships without disruptive schema changes.

3. Efficient Relationship Queries:

Document databases can store hierarchical data but may not be as efficient when traversing relationships. In benchmarks, graph databases like Neo4j demonstrate the ability to traverse relationships several orders of magnitude faster than relational & document databases.

4. Semantic Search Capabilities:

Documents often have implicit relationships and semantics that can be uncovered using graph algorithms. With graph-based storage, algorithms like PageRank, community detection, or shortest path can be applied to discover hidden relationships or rank document relevance. There are also various index types available for both keyword-based and vector search, as what was released recently by Neo4j.

5. Enhanced Data Linking:

In interconnected data, one piece of information can be related to another in multiple ways. Using a graph, you can connect a document to an author, an institution, a topic, a timestamp, etc., enabling multi-faceted queries and analytics. This is evident in knowledge graph implementations where articles, entities, and concepts are interlinked.

6. Improved Data Integrity:

Graph databases use relationships as first-class citizens.By enforcing relationships at the database level (like ensuring every “author” node is connected to a “document” node), data integrity issues can be minimized.

7. Intuitive Visualization:

Humans often find graph-based visualizations more intuitive when exploring complex data sets. Tools like Neo4j Bloom and Browser provide visualization capabilities for graph data, helping users understand and explore the relationships between documents and their associated entities.

8. Scalability and Performance:

Modern graph databases are designed to scale horizontally, accommodating large volumes of interconnected data. Databases like Neo4j has demonstrated scalability with billions of nodes and relationships, maintaining efficient query performance.

If Native Property Graph sounds infamiliar to you yet, here is an easy-to-follow guide to start with:

Bringng Structure to Unstructured Data in A Graph

Documents, which are commonly considered to be unstructured, are actually more structured than we thought. Let’s take a closer look, using a paper from arxiv, and Neo4j as the graph database.

Document Metadata as Graph

Below is the first page of a paper in PDF format from arxiv. It is quite obvious that metadata of this paper can be extracted using existing libraries like PyPDF2 into well structure datad, so I will skip this step.

Extracting metadata of a paper. Source: https://arxiv.org/abs/2305.14449

What I am showing here is once metadata is identified and extracted, how it is loaded into a Neo4j graph database. Of course, you’ll need a Neo4j database to begin with, which can be created from Aura for FREE:

Here are Cypher statements needed to create a graph for paper metadata:

// -----------------------------
// 1. Create nodes
// 1.1 Create Document node
CREATE (d:Document{url:'https://arxiv.org/abs/2305.14449'})
// 1.2 Create Subject Node
CREATE (s:Subject{text:'Graph Meets LLM: A Novel Approach to Collaborative Filtering for Robust Conversational Understanding'})
// 1.3 Create Author nodes (for the first 3)
CREATE (author1:Author{fullname:'Zheng Chen', email:'zgchen@amazon.com'})
CREATE (author2:Author{fullname:'Ziyan Jiang', email:'ziyjiang@amazon.com'})
CREATE (author3:Author{fullname:'Fan Yang', email:'ffanyang@amazon.com'})
// 1.4 Create Abstract node
CREATE (abstract:Abstract)
// 1.5 Create Keyword node
CREATE (kw1:Keyword{text:'Collaborative Filtering'})
CREATE (kw2:Keyword{text:'Large Language Models'})
CREATE (kw3:Keyword{text:'Query Rewriting'})
// -----------------------------
// 2. Connect nodes
CREATE (d) -[:HAS_SUBJECT]-> (s)
CREATE (d) -[:HAS_AUTHOR]-> (author1)
CREATE (d) -[:HAS_AUTHOR]-> (author2)
CREATE (d) -[:HAS_AUTHOR]-> (author3)
CREATE (d) -[:HAS_ABSTRACT]-> (abstract)
CREATE (d) -[:HAS_KEYWORD]-> (kw1)
CREATE (d) -[:HAS_KEYWORD]-> (kw2)
CREATE (d) -[:HAS_KEYWORD]-> (kw3)
RETURN *
// DONE

After running the code above, you should be able to see a visualization like this:

Metadata graph for a paper shown in Neo4j Browser.

You may have noticed several things:

i. Neo4j creates data schema during the creation process, which makes the ingestion of any forms of data quite easy and straightforward.

ii. The property graph model is very intuitive to both technical and non-technical audiences.

iii. The visualization makes graph model super expressive and illustrative.

These are not all yet. Let’s continue our journey with actual contents of the document.

Text Corpus as Graph

There have been numerous methods done so fat to extract useful entities and relationships from free-style text, and more are to come with better and better performance.

Named entities like people, organizations, locations, dates, and other proper nouns, provide context and specificity to the document and are often crucial for tasks like entity linking and knowledge graph construction. Relationships are connections or associations between entities, topics, or other document components, often presented as verbs.

Extract structure from free-style text.

Once extraction is done, the Cypher script below can be run to create the graph, and connect new nodes with existing abstract and document nodes.

// 1. Find the Document node by its url, and its Abstract node
MATCH (doc:Document{url:'https://arxiv.org/abs/2305.14449'}) -[:HAS_ABSTRACT]-> (abstract)
WITH doc, abstract
// 2. Create nodes for paragraph, sentence and keyword
// 2.1 Create Sentence nodes from the first paragraph of abstract
CREATE (sentence1:Sentence{text:'Conversational AI systems like Alexa, Siri, and Google Assistant require an understanding of defective queries to ensure robust con- versational functionality and minimize user friction. ', seq:1})
SET sentence1.texthash = apoc.hashing.fingerprint(sentence1.text)
CREATE (sentence2:Sentence{text:'Such defective queries often stem from user ambiguities, errors, or inaccuracies in automatic speech recognition (ASR) and natural language under- standing (NLU).', seq:2})
SET sentence1.texthash = apoc.hashing.fingerprint(sentence1.text)
CREATE (kw4:Keyword{text:'conversational AI'})
CREATE (kw5:Keyword{text:'automatic speech recognition'})
CREATE (para:Paragraph)
SET para.texthash = apoc.hashing.fingerprint(sentence1.text + sentence2.text)
// 3. Connect nodes
CREATE (abstract) -[:HAS_PARAGRAPH]-> (para)
CREATE (para) -[:HAS_SENTENCE]-> (sentence1)
CREATE (para) -[:HAS_SENTENCE]-> (sentence2)
CREATE (sentence1) -[:HAS_KEYWORD]-> (kw4)
CREATE (sentence1) -[:HAS_KEYWORD]-> (kw5)
CREATE (sentence2) -[:HAS_KEYWORD]-> (kw4)
CREATE (sentence1) -[:HAS_NEXT_SENTENCE]-> (sentence2)
RETURN *

The results look like this:

Turn text corpus into a graph.

The text corpus graph has the original text content, stored in property text of Sentence nodes, as well as the hierarchical structure of the document, i.e. Document > Abstract > Paragrapn > Sentence > Keyword. The same process can be applied to the whole document quite straightforward.

The document graph has enabled a few more benefits:

i. There are hashing codes generated for text at sentence and paragraph level. This can help with identifying future changes to the text in a quite granular and efficient way so as to simplify the incremental data ingestion in the future.

ii. Between sentences, there is a relationship called HAS_NEXT_SENTENCE to connect adjacent sentences. This approach keeps corelation and relevance among sentences, which can benefit text embedding and semantice retrieval a lot.

iii. Keywords mentioned in sentences are extracted as needed. What’s more significant here is for the same keyword, e.g. converational AI, it is only created once and referenced in two sentences. In real implementation, this is usually achieved by using MERGE instead of CREATE.

Adding Embedding to Text

Now it is LLM’s turn to add semantic representations to the document graph. In one of my previous posts below, I showcased how to store text embedding / vector as property of an Embedding node.

After loading text embeddings for sentences, this would be how the graph model look like till now:

Our first draft graph model for document.

It shouldn’t be hard to continue building up the model by adding every section of the document into this graph, incl. main content, conculsion, references, images (metadata and labels) and even tables, together with their embeddings, and eventually to make this a multi-modal graph.

Creating Text Index

Fulltexr index can be added to Sentence nodes:

CREATE FULLTEXT INDEX sentenceText FOR (n:Sentence) ON EACH [n.text];

Creating Vector Index

Vector index can be added to Embedding nodes, and COSINE similarity is selected for match. For OpenAI generated embeddings they have 1536 dimensions. Others may vary.

CALL db.index.vector.createNodeIndex('embedingIndex', 'Embedding', 'value', 1536, 'COSINE');

Using Both Text and Vector index for Hybrid Search

With two indices at hand, we can make the retrieval more accurate and relevant. The idea can be described in two steps:

1 ) using embedding for semantic match;

2 ) for returned results, doing a text search to rerank the final results.

A psuedo Cypher query would look like this:

// Sample hybrid search
// Assume: $question = 'what is LLM?'
// $qembedding is the text embedding of $question

// 1. Search for most similar sentence using vector index
CALL db.index.vector.queryNodes("embeddingIndex", 50, $qembedding) YIELD node, score
WITH node AS embnode, score AS semanticScore
// 2. Get sentences of the returned embedding nodes
MATCH (embnode) <-[:HAS_EMBEDDING]- (sentence:Sentence)
// 3. Store results in collections
WITH collect(sentence) AS sentences, collect(semanticScore) AS semanticScores
// 4. Get text search score
CALL db.index.fulltext.queryNodes("sentenceText", $question) YIELD node, score
WITH sentences, semanticScores, node AS sentence2, score AS textScore
WHERE sentence2 IN sentences // keep sentences returned in both searches only
// 5. Rerank the results based on both semantic and text scores
... ...

Generation with Better Relevance & Accurancy

Embedding based search does very well in catching the semantics of the question, however it also has limitations e.g.:

  • it performs bad with rare concepts ie. long tail words that LLM did see enough during the pre-training stage
  • it performs worse than text based search for exact match

Hybrid search can help with ensuring the reletively stable accuracy and relevancy caused by the factors above.

Another common challenge of RAG is how many documents to retrieve for generation at later stage. The realisity is that there is the prompt size limitation of LLM APIs, and it is not the longer document the better results according to this study.

Source: Lost in the Middle (link).

With document stored as a graph, there are simple but effective strategies to takle the challenge too, e.g. the vector search can be based on single sentence, but when returning the content, it is better to also return adjacent sentences to include more context. This can be easily done with the Cypher below which returns a range of 3 sentences before and after a matched sentence:

// assume $sentence is the sentence node with the high simalrity score
MATCH ($sentence) -[:HAS_NEXT_SENTENCE*..3]- (adjsentence)
RETURN collect(adjsentence.text) + sentence.text AS context

Document Graph — Roadmap of Further Enrichment

Other than flexible schema, powerful query capability and efficient search execution, when document is treated as graph, a lot of existing approaches in the graph data science space can be applied to enrich the knowledge graph, and make it ready for more advanced use cases too.

Co-author Graph

From the document graph, authors of same paper can be connected with a new relationship e.g. CO_AUTHOR_WITH, so to create a Co-author Graph to futher improve the search relevance, and support more use cases (just list a few here):

  1. Identifying Collaboration Patterns: Understand which authors frequently collaborate and potentially uncover research teams or groups.
  2. Determining Influential Authors: By analyzing node centrality or other graph metrics, one can identify authors who are central or influential in a research network.
  3. Exploring Research Communities: Detect tightly-knit communities or clusters of researchers who often work together, revealing subfields or specialized research groups. It is then possible to suggest potential collaborators to researchers based on past co-authorships and mutual connections.

Co-reference Graph

When papers have same keywords, or they were authored by same (group of) authors, or they cited same sources, a Co-reference Graph can be inferred too. For RAG, other than adjacent sentences, adjacent papers can also be used during Reranking, and returned as part of the context.

To experiment on this idea, you may find my another post helpful:

Enhancing Accuracy & Relevant through Ontology

Ontologies provide a structured representation of knowledge, defining the concepts, relationships, and hierarchies within a specific domain. Incorporating ontology into Retrieval-Augmented Generation (RAG) can significantly enhance its performance and capabilitiesKnowledge Enrichment through Graph Data Science.

Again, below is a great resource (thanks to J Barrasa) for you to get deep into this.

Fine-tuning over Graph (rather than LLMs)

Document graph is stored in a database, which makes it very easy to query and update. In fact, to improve embedding generation quality, fine-tuning a LLM is not always the single or even better option.

I highly recommend this post by my colleague Tomaz Bratanic who explained how word embedding generated by a public pre-trained LLM can be improved without fine-tuning the LLM itself (which is always more tricky and expensive).

Conclusion

Knowledge store is one critical component of RAG applications. Storing and modeling documents as native property graphs provides several advantages. Graph databases and the graph data model have distinct strengths that can make them well-suited for representing, querying, and analyzing document-based information, in both text-based and semantic senses.

Before I conclude, I should mention there is another popular type of graph store, i.e. Resource Description Framework (RDF) graph. Here is a comprehensive comparison between the two by J Barrasa too. I believe once you read through it, you will have your conclusion which graph, property vs. RDF graph, is a better choice.

--

--

Fanghua (Joshua) Yu

I believe our lives become more meaningful when we are connected, so is data. Happy to connect and share: https://www.linkedin.com/in/joshuayu/