Adding Structure-Aware Retrieval to GenAI Stack

Fanghua (Joshua) Yu
9 min readNov 3, 2023

Make RAG Work Better for Complex Documents

The Globe Hung in the Hall of St. Charles Church, Vienne, 2019. Photo by author.

Abstract

The real success of Retrieval Augmented Generation (RAG) solution relies heavily on the quality and performance of the retrieval process. So far, a lot of research and work have been focusing on improving text embedding, prompt engineering and reranking of vector search results. However, without high quality contents, well-understood context and efficient retrieval, those practices would only have limited contribution. In this article, a structure-aware chunking strategy, combined with graph based retrieval method has been discussed and sample provided to address the needs.

The Challenges of RAG

Overview of The RAG Solution

Among all contents talking about RAG, I found the one below has just enough context with the simplest possible illustration:

Source: Neo4j, Inc.

Apparently, apart from the generation performance of LLMs, the key to a successful implementation of RAG solution relies heavily on the performance of retrieval process, i.e. the relevance and accuracy of documents / text corpus retrieved for a given question. This process can be implemented by using traditional text-based search, or semantic search through text embeddings, or a combinaiton of both (hybrid search).

So far, a lot research and improvements have been focusing on the embedding model, reranking of search results, and prompt fine-tuing. However, without high quality contents, well-understood context and efficient retrieval, those practices would only have limited contribution.

In this article, I’d like to explore potential opportunity at the content side, which :

  1. adopts a structure-aware chunking strategy for documents
  2. leverages Neo4j graph traversals for structure-aware retrieval
  3. demonstrates how to implement this stratgy using GenAI Stack

Existing Chunking & Retrieval Methods

LLMs all have certain size limit for the input text for embedding. When source douments are long, chunking is necessary to break down contents into smaller parts.

There are common chunking strategies available in most of the NLP / LLM frameworks e.g. LangChain, LlamaIndex, which fall into the below categories:

a. Fixed-size chunks: Define a fixed size that’s sufficient for semantically meaningful paragraphs (for example, 200 words) and allows for some overlap (for example, 10–15% of the content). This is the simplest method and can perform quite well.

b. Variable-sized chunks based on content: Chunking data based on content characteristics, such as end-of-sentence punctuation marks, end-of-line markers. Documents having markdowns, e.g. HTML tags can benefit those tag to plit the data as per its structure in the document.

c. Customize or iterate over one of the above techniques. For example, when dealing with large documents, it is useful to make variable-sized chunks, and append the document title to each chunk so that those from the middle of the document will not suffer from context loss.

However, those strategies still have the challenges when dealing with documents of complex structure, or use cases which requirie guaranteed accuracy. Thinking about an operational manual, or medical instructions, the chunking strategy has to consider:

  • completeness of chunks retrieved: otherwise some steps may be missing.
  • complex layout: dual- or multi-column, with unusual sequence of blocks.
  • embedded text to be ignored: e.g. header, footer, watermark etc.
  • rich contents, e.g. pictures, photos, tables and so on.
Source: https://assuredcomfortbed.com/

In those scenarios, simple fixed size chunking would never be able to give complete or accurate answers. This isn’t something thay can be solved by improving embedding model, reranking, or prompt engineering.

We need a solution to keep structural relationships between chunks, and the capability to leverage the structural context for better retrieval.

Structure-aware Chunking & Retrieval

Documents Are Property Graph

In my previous blog post, I explained why documents are property graph. In fact, storing and modeling documents as native property graphs provides several advantages, including flexible schema, powerful query capability and efficient search execution. Graph databases and the graph data model have distinct strengths that can make them well-suited for representing, querying, and analyzing knowledge from documents, in both text-based and semantic senses.

Our documents are not as simple as text. For human-being to easily read and understand, documents are prepared in visually interpretable manner, i.e.the so-called Visually Structure Document (VSD). The structure is often in the form of tables, charts, graphs, diagrams, maps, and other visual formats that display relationships, patterns, and trends. This type of data presentation allows us to quickly grasp complex information that would be less accessible if presented in unstructured formats, like raw text or a series of numbers.

Using existing PDF parsing packages, we can easily extract both the structure and the text, and loaded into a knowledge graph, the data model of which is illustrated as below.

A graph model for document stored in Neo4j.

Graph can very well handle the nested structure of chunks. For example, a highlighted sentence inside a paragraph, in which case the highlighted text is a chunk, with a relationship HAS_PARENT pointing to the parent paragraph, which is another chunk.

Section node preserves document layout that aligns with how contents of the document are organized originally. There can be hierarchy of sections of variant levels.

Text embeddings are created and stored for each chunk, and vector index will be populated for efficient search.

With the base model in place, more data can be connected to it. For exampla, for a given user question, which chunks are retreived, and what score is given to each of them, as shown in the diagram by relationship RELATED_TO.

Generated answers are also stored and linked to questions, which can provide necessary support for auditing and explanability purposes.

Semantic + Structure-aware Retrieval

Once we have text, embedding of text, as well as document structure stored in the same knowledge graph, retrieval becomes much more flexbile and powerful. Below are the steps:

>>>> i. For a given question, send it to embedding API to generate its vector.

>>>> ii. Do a similarity search of question embedding vector over chunk embeddings in the KG.

>>>> iii. For top chunks returned, traverse along HAS_PARENT relationships to locate sections containing them.

>>>> iv. Retrieve all sub-chunks of the sections, and use them for generation.

This approach garantees the completeness of retrieved contents for a specific question, which is not limited by the individual chunking size, and is able to provide just enough content.

If the retrieved contents exceed the size limit of the generation model, a pagination method can be implemented to break answers into several parts too.

Now let’s see how this can be done using the GenAI Stack.

GenAI Stack

Overview

The GenAI Stack was announced at the DockerCon last month, which is a great way to quickly get started building GenAI-backed applications. It includes Neo4j as the default database for vector search and knowledge graphs, LangChain for LLM development framework, Ollama for LLM hosting and Docker for fast deployment.

Neo4j’s Announcement on GenAI Stack. Source: link.

You can also download the project from Github:

Configuration

If you just want to test ideas by using LLM APIs, Ollama is not required. For Neo4j, AuraDB is the DaaS option which takes away the need to run Docker, then all that requires is knowing how to write Python code.

In the GenAI Stack project folder, look for the .env file and make some changes:

OPENAI_API_KEY=sk-YOUR-OPENAI-KEY
#OLLAMA_BASE_URL=http://host.docker.internal:11434
NEO4J_URI="neo4j+s://AURADB-INSTANCE:7687"
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=AURADB-PASSWORD
NEO4J_DATABASE=neo4j
LLM=gpt-4
EMBEDDING_MODEL=openai

LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
# LangSmith account requried if tracing is enabled
LANGCHAIN_TRACING_V2=false #true/false
LANGCHAIN_PROJECT=#your-project-name
LANGCHAIN_API_KEY=#your-api-key ls_...

LLM & Embedding Model

Here I chose to use gpt-4 API, and the text-embedding-ada-002 model from OpenAI as the embedding model.

Neo4j AuraDB

For steps to get your own Neo4j AuraDB instance FOR FREE, you may refer to this article.

Steps to Add Structure-aware Retrieval

With GenAI Stack, adding new features is quite easy. Here we only need to change two Python files.

  1. Create A New Chain in chains.py

Let’s call our new chain configure_qa_structure_rag_chain.

This is simply done by sending a retrieval_query parameter when initialising the Neo4jVector store. The retrieval_query implements the query logic to find containing sections of matched chunks, as described in the sections above.

def configure_qa_structure_rag_chain(llm, embeddings, embeddings_store_url, username, password):
# RAG response based on vector search and retrieval of structured chunks

general_system_template = """
You are a customer service agent that helps a customer with answering questions about a service.
Use the following context to answer the question at the end.
Make sure not to make any changes to the context if possible when prepare answers so as to provide accuate responses.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----
{summaries}
----
At the end of each answer you should contain metadata for relevant document in the form of (source, page).
For example, if context has `metadata`:(source:'docu_url', page:1), you should display ('doc_url', 1).
"""
general_user_template = "Question:```{question}```"
messages = [
SystemMessagePromptTemplate.from_template(general_system_template),
HumanMessagePromptTemplate.from_template(general_user_template),
]
qa_prompt = ChatPromptTemplate.from_messages(messages)

qa_chain = load_qa_with_sources_chain(
llm,
chain_type="stuff",
prompt=qa_prompt,
)

# Initialise Neo4j as Vector + Knowledge Graph store
kg = Neo4jVector.from_existing_index(
embedding=embeddings,
url=embeddings_store_url,
username=username,
password=password,
database='neo4j', # neo4j by default
index_name="chunkVectorIndex", # vector index name
node_label="Embedding", # embedding node label
embedding_node_property="value", # embedding value property
text_node_property="sentences", # text by default
retrieval_query="""
WITH node AS answerEmb, score
ORDER BY score DESC LIMIT 10
// 1 - Locate section of matched chunks
MATCH (answerEmb) <-[:HAS_EMBEDDING]- (answer) -[:HAS_PARENT*]-> (s:Section)
WITH s, answer, score
// 2 - Find out which documents
MATCH (d:Document) <-[*]- (s) <-[:HAS_PARENT*]- (chunk:Chunk)
WITH d, s, answer, chunk, score ORDER BY d.url_hash, s.title, chunk.block_idx ASC
// 3 - prepare results by concatenating sentences of chunks in the section
WITH d, s, collect(answer) AS answers, collect(chunk) AS chunks, max(score) AS maxScore
RETURN {source: d.url, page: chunks[0].page_idx+1, matched_chunk_id: id(answers[0])} AS metadata,
reduce(text = "", x IN chunks | text + x.sentences + '.') AS text, maxScore AS score LIMIT 3;
""",
)

kg_qa = RetrievalQAWithSourcesChain(
combine_documents_chain=qa_chain,
retriever=kg.as_retriever(search_kwargs={"k": 25}),
reduce_k_below_max_tokens=False,
max_tokens_limit=7000, # gpt-4
)
return kg_qa

2. Use the new chain in a chatbot


load_dotenv(".env")

url = os.getenv("NEO4J_URI")
ollama_base_url = os.getenv("OLLAMA_BASE_URL")
embedding_model_name = os.getenv("EMBEDDING_MODEL")
llm_name = os.getenv("LLM")

embeddings, dimension = load_embedding_model(
embedding_model_name, config={"ollama_base_url": ollama_base_url}, logger=logger
)

llm = load_llm(llm_name, logger=logger, config={"ollama_base_url": ollama_base_url})

# rag_chain: KG augmented response, using structure-aware retrieval
rag_chain = configure_qa_structure_rag_chain(
llm, embeddings, embeddings_store_url=url, username=username, password=password
)

In the project, there is a Q&A chatbot (bot.py) for stackoverflow dataset, which can be used as a template to test this new chain.

Further Discussion

Storing text, embedding and structure data of documents in the same knowledge graph improves accuracy, completeness and relevance of retrieval process.

By using graph query language like Cypher, it provides more flexible, practical and powerful capabilities to retrieve contents for better generation process, which doesn’t involve any fine-tuning of LLMs either.

To learn more about Neo4j’s vector index, or Cypher query language, you can check the links below.

Vector Index

Learn Cypher

GraphAcademy — Free, Self-Paced, Hands-on Online Training
We’re here to guide you on a fun and engaging journey to mastering Neo4j with free, hands-on courses.

--

--

Fanghua (Joshua) Yu

I believe our lives become more meaningful when we are connected, so is data. Happy to connect and share: https://www.linkedin.com/in/joshuayu/