How GraphRAG works?

Understand GraphRAG internal working with example

Mehul Gupta
Data Science in your pocket
5 min readJul 10, 2024

--

In my previous post, I’ve touched upon Generative AI’s new advancement i.e. GraphRAG, a major improvement over baseline RAG which uses Knowledge Graphs and its basics. In this post, we will dive a little deeper and understand how it works internally. In case you missed the previous post on GraphRAG, you can check it here.

In this post, we will walk through how GraphRAG works alongside the difference between Global and Local Search.

My debut book : LangChain in your Pocket is out !

Let’s get going !!

Step 1: Knowledge Graph Creation

1. Data Collection:

Gather a large corpus of text data containing the information needed for the knowledge graph. This could include articles, books, research papers, etc. This is plain text

2. Entity and Relation Extraction:

Entity Extraction: Use LLMs and NER tools to identify and extract entities (e.g., people, places, events) from the text. For this, SOTA LLMs are a must else you may face quality issues in your results.

Can LLMs be used for NER extraction? Anyday yes. I tried a POC sometimes back. This may give you an idea how it works

Relation Extraction: Determine relationships between these entities using LLMs and rule-based systems to extract meaningful connections.

Now, as we have all the ingredients to build a graph database, let’s construct our Graph

3. Graph Construction:

Use the above extractions to create a graph object with the following details (Neo4j can be helpful):

Nodes: Represent extracted entities as nodes in the graph.

Edges: Represent relationships between entities as edges connecting the nodes.

Attributes: Add attributes to nodes and edges to store additional information, such as type of entity, type of relationship, and context. They are also extracted while extracting entities

You can use this Graph Analytics series to get started with Neo4j

4. Graph Storage:

Store the constructed graph in a graph database (e.g., Neo4j), which supports efficient querying and manipulation of graph structures.

Step 2: Community Summaries Generation

1. Community Detection:

Use hierarchical community detection algorithms to identify clusters or communities within the graph.

Notes: Communities are groups of nodes (entities) that are densely connected with each other but sparsely connected with other nodes.

I’ve already explained some Community Detection algorithms in my previous post.

2. Summary Extraction:

Identify central nodes (entities with high connectivity or importance within the community). How? Using Graph Analytics:

Aggregate relevant information and relationships associated with these central nodes.

Summarize the extracted information to generate community summaries using LLMs.

Step 3: Retrieval in GraphRAG

1. Query Processing:

Process the user query to identify key entities and relationships relevant to the query. Again LLMs will be used to identify entities and relationships of interest from user query

2. Graph-Based Retrieval:

Subgraph Extraction: Extract a subgraph from the knowledge graph that contains the entities and relationships relevant to the query. This subgraph includes nodes and edges directly connected to the key entities mentioned in the query.

Context Expansion: Expand the context by including additional nodes and edges closely related to the subgraph, providing a richer context for retrieval.How?

This can be done in multiple ways. Either extract all the nodes to which the subgraph entities have a path or Extract the entire community or Extract nodes at a distance < X from the subgraph nodes where X is a distance threshold or using other Relationship/Link prediction Graph algorithms discussed below:

3. Information Aggregation:

Collect relevant facts and relationships from the nodes and edges in the subgraph using Graph Analytics

Summarize the collected information to address the query effectively using LLMs

4. Response Generation:

Generate a natural language response based on the aggregated information. This response leverages the rich context provided by the subgraph to deliver a more accurate and comprehensive answer.

Example Workflow

  1. Knowledge Graph Creation:

Process a large corpus of scientific literature to extract entities such as “protein”, “gene”, and “disease”, and relationships like “interacts with” and “causes”.

Build a knowledge graph where nodes represent proteins, genes, and diseases, and edges represent their interactions and causal relationships.

2. Community Summaries Generation:

Analyze the graph using hierarchical community detection algorithms to identify clusters of closely related proteins, genes, and diseases.

For each cluster, identify central nodes (e.g., key proteins) and summarize their interactions and effects to provide an overview of the cluster.

3. Retrieval in GraphRAG:

Process a user query about the relationship between a specific protein and a disease.

Extract a subgraph containing the queried protein, the disease, and related entities.

Include additional related entities and relationships to expand the context.

Aggregate information from the subgraph to generate a comprehensive response explaining the relationship between the protein and the disease.

Before we wrap up, I need to touch upon the two types of search options GraphRAG provides, Global and Local search?

Global Search vs. Local Search in GraphRAG

Global Search involves querying the entire knowledge graph to find relevant entities and relationships. It searches across the entire graph to provide a comprehensive set of results. Global Search is used when the query requires a broad understanding and connections from various parts of the graph. It’s useful for queries that are not highly specific and benefit from a wider context.

Local Search focuses on a specific subgraph or a local neighborhood of nodes within the knowledge graph. It limits the search to a smaller, more defined area of the graph. Local Search is used when the query is highly specific and pertains to a particular part of the graph. It’s useful for queries that require detailed and focused information from a specific context.

Let’s understand with an example. Assume the below query for GraphRAG

Query: “Explain the relationship between Protein A and Disease X.”

Global Search

  • Process: The entire knowledge graph is searched to find all possible connections between Protein A and Disease X.
  • Results: The search might reveal not only the direct interactions between Protein A and Disease X but also indirect connections through other proteins, genes, pathways, and related diseases.

Example Output:

Protein A interacts with Protein B, which is known to influence Gene Y.

Gene Y has been implicated in Disease X.

Protein A is also part of a pathway that is often disrupted in Disease X.

Local Search

  • Process: The search is limited to a subgraph around Protein A and Disease X, focusing on immediate connections.
  • Results: The search might reveal direct interactions between Protein A and Disease X and immediate neighbors in the graph.

Example Output:

Protein A directly interacts with Disease X.

Protein A and Disease X are both related to Gene Z within a specific biological pathway.

With this, I will wrap this post. See you next !

--

--