Exploring GraphRAG: Improving Retrieval in RAGs — Part 1

Kanishk Tyagi
Yugen.ai Technology Blog
10 min readAug 14, 2024

This is a multi-part series involving different explorations done to understand the potential of GraphRAGs. This is the first part, where we go through the official repo and paper and investigate certain sections of the workflow code. In future releases, we’ll share learnings from practial experiments done at Yugen or from upcoming functionality releases at Canso, our ML/AI Platform.

By Kanishk Tyagi

The problem

There is an innate limitation within every LLM — being held back by the data they were trained on. With changing user requirements and companies desiring their systems to reference recent data with minimal delay, it becomes very important that, depending on the situation, the model is equipped with relevant context.

That is where non-parametric knowledge comes in. It’s known that LLMs memorize much of the world’s knowledge within their parameters (known as parametric knowledge). However, to adapt to dynamic and specialized contexts, retrieval-augmented generation (RAG) frameworks come into play. RAG models help the LLMs to retrieve and integrate non-parametric knowledge at inference time from external knowledge sources. This helps improve on the capacity to generate more accurate and contextually relevant responses.

GraphRAG allows for dynamic context integration and thus helps models adapt to new information and changing requirements. LLMs lack contextual and implicit knowledge as explained above, and knowledge graphs offer a structured representation of the data with clearly defined nodes and edges representing key information and insights extracted from the data. Due to its graph nature, GraphRAG is able to handle and retrieve huge amounts of interconnected data in order to provide more accurate, context-sensitive responses. In this way, the adaptability assures that the models remain relevant and useful amid a constantly evolving landscape of information and user needs.

Where GraphRAG comes in

GraphRAG is an open-source library from Microsoft designed to enhance the capabilities of traditional retrieval-augmented generation (RAG) models by offering a more structured and contextually rich approach to knowledge retrieval. It dynamically integrates context to help models adapt to new information, crucial for evolving user requirements.

Impact:

  • Dynamic Context Integration: GraphRAG enables models for dynamic integration and recovery of context-specific information, hence helping in the provision of more exact and contextualized responses. This graph-based retrieval concept localizes relevant knowledge more effectively compared to traditional ways.
  • Handling Interconnected Data: It is a library designed for efficient treatment and navigation of large, interconnected datasets. This is graph-structured, organizing information in such a way that it can optimize its retrieval, making related entities easily accessible and the relationships between data points preserved.
  • Adaptability: GraphRAG has the ability to enhance the adaptability of a model through the integration of new data and evolving contexts continuously, ensuring that models remain relevant within this fast-changing information environment. This makes it especially good at applications where one needs real-time updating of data and synthesizing complex information.

Key Elements:

  • Entities: Represent the core units of information within the graph, such as specific concepts, items, or attributes. These entities and the relationships between them are the building blocks of the graph structure, allowing the model to understand and retrieve relevant pieces of information.
  • Communities: These are clusters of interrelated entities within the graph, representing groups or categories that share common characteristics. Communities help the model to identify and focus on the most relevant subsets of data, improving the efficiency and accuracy of the retrieval process.

Deep dive

Pipeline Overview

Indexing

  1. Chunking
  2. Entity extraction
  3. Community detection

Querying

  1. Local search
  2. Global search

Indexing Architecture

The Indexing package is a configurable data pipeline designed to extract meaningful, structured data from unstructured input using Large Language Models (LLMs). This pipeline comprises various components such as workflows, steps, and prompts, which work together to perform several key functions by default:

  1. Extracting Entities and Their Relationships: Identify entities within the text and map out relationships
  2. Detecting Communities within Extracted Entities: Group related entities into communities
  3. Summarizing and Reporting on Communities: Summarize each community and generate reports for deep-dives.

Key Steps in the Pipeline

1. Text Chunk Generation

The source document is split into appropriately sized chunks, a crucial step that affects both the number of LLM calls required and the quality of the results.

Chunk Size Impact:

  • Larger Chunks: Fewer LLM calls are needed, reducing costs but potentially degrading recall due to the LLM’s longer context window.
  • Smaller Chunks: Although more expensive due to increased LLM calls, smaller chunks yield better results, extracting significantly more entity references.
  • Empirical Findings: Experimentation has shown that a chunk size of 600 words can extract nearly twice as many entity references compared to a chunk size of 2400. However, balancing recall and precision is essential for optimal performance.

2. Element Instances and Disambiguation

Proper extraction of entities and their relationships is critical for constructing a superior knowledge graph, which directly impacts the relevance and accuracy of query responses.

  • Multipart LLM Prompt: The extraction process uses a multipart LLM prompt that first identifies all entities in the text, capturing their names, types, and brief descriptions. It then extracts relationships between clearly related entities.
  • Tailored Prompts: The default prompt can be customized for specific domains, enabling the extraction of domain-specific entities.
  • Iterative Refinement: The process involves multiple passes over the text chunks, continuing until the LLM confirms that no entities were missed. This method allows the use of larger chunk sizes while still maintaining high quality.
  • Disambiguation Challenges: Despite the redundancy-reducing steps in the GraphRAG library, we noticed disambiguation between similar entities to be a challenge during our preliminary experiments. Existing approaches may not fully address the nuances between contextually overlapping entities, leading to occasional duplication. This complexity arises from the dynamic nature of language and context. Future research should focus on developing more advanced algorithms capable of better distinguishing entities and their contexts, ultimately improving the accuracy and reliability of entity extraction in complex datasets.

3. Graph Communities

After the extraction and summarization of the element instances, the next step is to model these elements as a graph, where entity nodes are connected by relationship edges. The current implementation employs the Leiden community detection algorithm due to its ability to recover hierarchical community structures in large-scale graphs, a common scenario in various use cases warranting the use of GraphRAG. The hierarchical levels within these communities facilitate a divide-and-conquer approach to global summarization, enhancing the overall effectiveness of the summarization process.

Possible Metrics and Performance Indicators

  • Entity Extraction Accuracy: Measured by the percentage of relevant entities correctly identified within the text.
  • Relationship Mapping Precision: Evaluated by the accuracy of relationships identified between entities.
  • Chunk Size Efficiency: Balanced by the trade-off between recall (number of entities extracted) and precision (accuracy of extraction).
  • Disambiguation Effectiveness: Monitored by the reduction in duplication of similar entities across the dataset.
  • Community Detection Quality: Assessed by the coherence and relevance of the detected hierarchical community structures.

These metrics can help in assessing the performance of the Indexing pipeline and provide insights into areas requiring further optimization or research.

Querying Pipeline

The query engine is the second major component of the GraphRAG library. For the scope of this article we’ll be going over the following two features of the query engine:

Local Search

The local search method will generate responses by fusing relevant information from the AI-extracted knowledge graph and text fragments from the original documents. Questions that would require deep understanding with regard to certain particular entities referenced in the documents have fit very well, such as “What are the main features of quantum computing?”

In the figure below we can see that based on the similarity entities are extracted that may be relevant to the queries. Context builder then accesses all of the occurrences of the entities throughout all text units, graph communities, summaries etc. These are then compiled and sent to the LLM alongwith the conversation history.

Global Search

Global search method A global search algorithm returns answers by scanning across all AI generated community reports using a map-reduce method. While resource intensive, this method is very useful when an answer requires complete knowledge of the dataset, for example, “What are the most important trends identified in this market analysis?”

Similar to local search a context builder is utilized to construct the context that is going to be sent to the LLM. The context builder follows a map-reduce approach. In the map step, community reports are divided into text chunks of a predefined size. Each chunk generates an intermediate response, listing key points with numerical ratings indicating their importance. During the reduce step, the most significant points from these intermediate responses are filtered and aggregated to create the context for the final response.

Code snippets

The run_pipeline function is a key component in the GraphRAG library, designed to execute a series of workflows on a dataset. This function allows you to chain together multiple operations, making it highly adaptable to various data processing and indexing tasks. Let’s break down the function and its arguments to understand how it operates:

async def run_pipeline(
workflows: list[PipelineWorkflowReference],
dataset: pd.DataFrame,
storage: PipelineStorage | None = None,
cache: PipelineCache | None = None,
callbacks: WorkflowCallbacks | None = None,
progress_reporter: ProgressReporter | None = None,
input_post_process_steps: list[PipelineWorkflowStep] | None = None,
additional_verbs: VerbDefinitions | None = None,
additional_workflows: WorkflowDefinitions | None = None,
emit: list[TableEmitterType] | None = None,
memory_profile: bool = False,
is_resume_run: bool = False,
**_kwargs: dict,
) -> AsyncIterable[PipelineRunResult]:

Now, let’s go over these arguments and understand them

workflows: list[PipelineWorkflowReference]

This argument takes a list of PipelineWorkflowReference objects, which define the specific workflows to be executed. Workflows are sequences of operations or steps that process the dataset, such as filtering, transforming, or indexing data. Each workflow is referenced by its identifier, allowing the pipeline to know which operations to execute. Some of the workflows in the library are -

Many other implementations can be found here.

dataset: pd.DataFrame

The dataset argument is the data you want to process, represented as a pandas DataFrame. This is the core dataset that will undergo the various workflows defined in the workflows argument.

storage: PipelineStorage | None = None

PipelineStorage is an abstract class that handles the storage of intermediate and final results produced by the pipeline. If provided, this argument allows you to specify where and how the pipeline results should be stored, whether in memory, on disk, or in azure blob storage. As of writing this article, following are the supported concretions

cache: PipelineCache | None = None

PipelineCache is another abstract class that manages caching within the pipeline. By caching intermediate results, the pipeline can avoid redundant computations, improving efficiency and reducing processing time. This is particularly useful in scenarios where the same data or steps are reused across multiple workflows. The following concrete implementations of PipelineCache are currently available

callbacks: WorkflowCallbacks | None = None

This argument takes an instance of WorkflowCallbacks, an abstract class defined in the DataShaper Library that defines callbacks or hooks to be triggered at various stages of the pipeline execution. Callbacks can be used to monitor progress, log information, or even modify data as it flows through the pipeline.

progress_reporter: ProgressReporter | None = None

The ProgressReporter class, if provided, is responsible for tracking and reporting the progress of the pipeline execution. This could be useful in long-running operations where it’s important to keep track of how much of the task has been completed.

input_post_process_steps: list[PipelineWorkflowStep] | None = None

This argument allows you to specify a list of additional steps to be applied to the dataset after the initial input processing. PipelineWorkflowStep is an abstract class that represents individual steps in the workflow. These steps might include tasks like normalization, validation, or enrichment of data.

additional_verbs: VerbDefinitions | None = None

VerbDefinitions is an abstract class that defines additional operations (or “verbs”) that can be applied within the workflows. Verbs are essentially commands that define what action to take on the data. This argument allows you to extend the set of available operations beyond the predefined ones.

additional_workflows: WorkflowDefinitions | None = None

Similar to additional_verbs, WorkflowDefinitions is an abstract class that allows you to define extra workflows that can be integrated into the pipeline. This is useful for modularizing and reusing workflow logic across different pipelines.

emit: list[TableEmitterType] | None = None

The emit argument specifies the output mechanisms or formats for the pipeline’s results. TableEmitterType is an abstract class that defines how and where the final data should be emitted, whether it be a csv, json or by default a parquet file.

How It All Comes Together

When the user calls run_pipeline, the function orchestrates the execution of the specified workflows on the provided dataset. It utilizes the storage and caching mechanisms to handle intermediate results efficiently and uses callbacks and progress reporters to keep track of the pipeline’s execution. The function is designed to be highly modular, allowing the user to customize and extend its functionality through additional verbs, workflows, and post-processing steps.

The function is fairly simple once we get a grasp of what each of the arguments signify and use them according to the needs of our use case.

https://microsoft.github.io/graphrag/posts/index/overview/

Conclusion and Future Scope

In this exploration of GraphRAG, we went through how the library extends a traditional RAG model with structured graph-based knowledge retrieval. Such properties of dynamic context integration, efficient handling of interconnected data, and adaptability make GraphRAG powerful for applications requiring real-time updates and complex information synthesis.

We will be diving into specific use cases that demonstrate practical implementations of GraphRAG. Future blog posts will explore how GraphRAG can be applied to extract meaningful insights from game reviews and analyse entity data from Amazon product reviews. These examples will provide hands-on guidance, illustrating how to leverage GraphRAG’s capabilities in real-world scenarios.

--

--