RAG-boosted with Knowledge Graph

An efficient and facilitated approach with the open source GraphRAG

An Truong
TotalEnergies Digital Factory
9 min readJul 19, 2024

--

RAG (Retrieval Augmented Generation) has been a popular approach for LLM-based application thanks to its versatility and adaptability. It is based on retrieving of relevant context to a given query and then synthesizing an answer using an LLM engine. This helps to bypass the token limit context window, overcome knowledge cutoff with updated information, and reduce hallucination. Moreover, it improves the traceability with source reference, ensuring that each response is not only accurate but also verifiable.

However, depending on the use case, RAG users might face some major issues:

  • Missing an overview of the context: RAG can only answer based on a limited number of top relevant pieces of information, obtained from the retrieval step, which may not provide a holistic understanding of the problem.
  • At scale: as the data corpus grows over time, the retrieval step becomes increasingly challenging due to a large amount of chunked texts, which can lead to a potential drop in overall performance.
  • Domain-specific knowledge: retrieval by keywords or semantic similarity often faces challenges when there are domain-specific words or abbreviations, as those pieces of information may not have been exposed to the embedded model, leading to less accurate or relevant results.
Example of a knowledge graph from a dynamic view.

In early July 2024, Microsoft announced the open-sourcing of GraphRAG, a graph-based approach to retrieval-augmented generation (RAG). GraphRAG revolutionizes the way we interact with large language models by incorporating a knowledge graph to enhance the retrieval process. This means that when you ask a question, GraphRAG doesn’t just generate an answer from the chunk of texts the most semantically similar to the query; it actually retrieves relevant information from a knowledge graph — distilled network of interconnected info (nodes) — before crafting its response. Let’s dive in and explore the potential of this exciting technology together with a hands-on example. But first, back to basic, how does this work?

How GraphRAG works

In a nutshell, GraphRAG operates by having the LLM create a knowledge graph that processes a private dataset to entities and their interrelations within the source data. This knowledge graph then facilitates the formation of a bottom-up clustering, organising data into semantic clusters based on the semantic similarity of the identified entities. When a query is made, the LLM leverages both the knowledge graph and these semantic clusters (called communities) as its context to provide a relevant and informed response.

The major difference from a classical Knowledge Graph approach is the added steps to summarize and clusterize the high-level entities and concepts. These high-level graph details are used during query time (in global mode) to bring the overall context to enrich the LLM-based answer. This allows for a more comprehensive understanding of the query, as it takes into account the broader themes and ideas represented in the knowledge graph, rather than just the individual entities.

Detailed steps are shown in the diagram below, as discussed in the original original paper from Microsoft Research.

GraphRag workflow — Microsoft Research, 2024.

In the indexing phase, similar to Vanilla RAG, input texts are split into chunks. LLMs are then used to extract entities from those chunks before being clustered recursively at different levels. A definition and raw text are also added as attributes for each extracted entity, allowing for the traceability in the final answer.

At query time, depending on the level of abstraction expected for the answer, there are different query methods, corresponding with the levels of communities that will be used to enrich the context for the answer.

For a question with a focus into a specific entity, the local search mode will use the connected entites from the Knowedge Graph to build the context for the final answer.

Credited from Microsoft Research repository.

When a user submits a query, the system uses a knowledge graph — a network of interconnected information — to find entities (key points) that are closely related to the user’s input. These entities act as entry points to pull more information from the knowledge graph, such as related entities, their connections, and additional details like reports from the community. It then organizes this information, prioritizing and filtering it to fit within a predefined space, called a context window. This organized content is what the system uses to craft a response to the user’s query, ensuring it is relevant and concise.

However, for a question which aims at a higher level of abstraction, and requires a holistic view, the global search mode is the way to go.

Credited from Microsoft Research repository.

The global search method operates in two main steps to handle user queries. In the first step, known as the ‘map’ step, the system uses reports generated by a large language model (LLM) obtained in the indexing phase for a specific level of a graph’s community hierarchy. These reports are broken down into smaller text chunks of a set size. Each chunk is then processed to create an intermediate response, which includes a list of points. Each point is given a numerical rating to show its relevance or importance.

In the second step, called the ‘reduce’ step, the system filters and selects the most important points from the intermediate responses. These selected points are then combined and used as the context for generating the final response to the user’s query. This method ensures that the response is focused and based on the most significant information related to the query.

Example use case walk through: GraphRAG on EU AI Act

Example of a knowledge graph from EU AI Act.

In this use case, we’re going to build a query engine with GraphRAG to help us on better understand the EU AI Act, the European regulation on artificial intelligent.

  1. Installation
    Create a new virtual and install graphrag package, e.g. using poetry or pip dependency management tool: poetry add graphrag or pip install graphrag.
  2. Data preparation
    At the writing time, we can only use text or csv. As the official AI Act was published in pdf, we will need to convert it into .txt . First, let’s get the pdf file to local.
mkdir input
curl https://www.europarl.europa.eu/doceo/document/TA-9-2024-0138_EN.pdf > ./input/europarl.pdf

For the conversion, we can use pypdf :poetry add pypdf.

from pypdf import PdfReader 
from tqdm import tqdm

pdf_file_path = '../input/europarl.pdf'
reader = PdfReader(pdf_file_path)

# printing number of pages in pdf file
print(len(reader.pages))

text = ''
for page in tqdm(reader.pages):
text += page.extract_text()

with open('../input/europarl.txt', 'w') as f:
f.write(text)

print(text[-100:])

3. Setup

Now that we have the graphrag installed and data ready, let’s try to setup the embedding engine and LLM. To initialize the project:

python -m graphrag.index --init --root .

This will generate the following structure.

The prompts contains the default prompts that will be used for knowledge graph creation. For now, let’s take a look on settings.yaml.

As you can see, this file contains the settings for connection to LLM as well as the prompts. The connection key GRAPHRAG_API_KEY can be set in .env file. The prompts provided by default can be customized as needed. For simplification, we will use the default setting and only change the connection, e.g. to in our case, AzureOpenAI instead of OpenAI.

Note: For visualization the graph, need to set graphml:True .

3. Knowledge Graph

We can now build our knowledge graph from the input text file europarl.txt.

python -m graphrag.index --root .

After this step, we will have the indexed data in the output folder:

Take a look on artifacts, there are three types of file type: .graphml , parquet , .json .

The .graphml files are for visualization of the extracted graph. .json files are for log, statistics analysis and raw extracted data. The graph data are recorded principally in .parquet files as illustrated above.

4. Query

All set! Let’s us query the knowledge graph in different modes.
First, let’s test the global mode.

python -m graphrag.query --root . --method global "which AI systems will be banned by the AI Act?"
Example out put of a global graph search.

To get the info from more specific point of view, we can try local mode.

python -m graphrag.query --root . --method local "which AI systems will be banned by the AI Act?"
Example out put of a local graph search.

5. Compare the performance

In term of performance, depending on the question we can observe differences in different approachs. For the comparison purpose, we will test the Vanilla RAG vs GraphRAG in local and global mode.

For most of the questions tested, the three methods gave answers more or less reasonable. To my surprise, even when asked about the “main themes of the document” — the questions that were mentioned in the original paper as the point of failure of Vanilla RAG, all three approaches managed to find an answer. Of course, the quality of the answer is another aspect to be considered. However, to be objective, the improvement in the performance is not that binary: fail or success.

What are the main themes?

Answers extract — When Vanilla RAG can answer but is less precise.

Now, if we try questions that require a certain level of understanding of the holistic context and reflexion on top of that knowledge, we begin to observe more distinct differences in performance.

Below are the responses for a query on:

Which elements can be self conflicting in the AI Act?

Answers extract — When Vanilla RAG will fail.

Another practical aspect to consider is the cost of running this method. Due to the requirements of multiple calls to LLM to extract, summarize, cluster the entities and relations to build and query the knowledge graph, it is expected that the cost for indexing and running GraphRAG is much higher than Vanilla RAG. For our example, we observed at least a 10–20x increase in terms of token used. For a more complex corpus, this number must be higher. Regarding the running time, given the small size of our graph, we only observed a slight difference of 2X in term of query time.

Final Thought

Overall, I was impressed with the performance and the ease of use of the package. In term of performance, to be fair, we can improve Vanilla RAG with different tuning on chunking, hierarchical indexing and metadata methods. So our point here is not to say that GraphRAG is the solution. Yet, given the fact that GraphRag is now open source, well documented technically and theoritically, it is a promising way to improve our RAG solution.

As discussed, GraphRAG significantly simplifies and enhances the creation of Knowledge Graphs (KG) compared to other methods I’ve seen, whether through ad-hoc prompt engineering or other LLM tooling. However, in this example, we are working with very simple input data. This raises several questions that would require further trials and thought:

  • How to efficiently update the KG if the input files change? This is crucial for maintaining the accuracy and relevance of the KG over time.
  • How to update the KG with additional data (files)? Integrating new data can expand the KG’s scope and depth.
  • Resolving conflicting information in the input data? This could be due to mismatches or cross-referencing of different data versions.
  • Production implementation? Deciding on the tech stack and approach for scaling from a Proof of Concept (POC) to production at various scalability levels.

These topics will be addressed in a follow-up article. Stay tuned and thanks a lot for reading!

References
https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/
https://github.com/microsoft/graphrag

https://microsoft.github.io/graphrag/
http://graphml.graphdrawing.org/
https://arxiv.org/pdf/2404.16130

--

--

An Truong
TotalEnergies Digital Factory

Senior datascientist with passion for codes. Follow me for more practical tips of datascience in the industry.