Chat with Your Graph

Graph Retrieval-Augmented Generation (Graph RAG)

13 min readMar 12, 2024

--

Imagine being able to ‘chat with your graph’ 💬. We introduce G-Retriever, a flexible graph question-answering framework that makes this possible!

This blog post is based on our recent paper “G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering” (arXiv Preprint, GitHub Code), and was written together with Xavier Bresson.

Outline

Introduction
Challenges: Hallucination and Lack of Scalibility
G-Retriever: A New Framework for GraphQA
GraphQA Benchmark
Experiments
Related Work & Discussion

1. Introduction

The landscape of artificial intelligence has been dramatically transformed with the advent of Large Language Models (LLMs), such as ChatGPT and Llama 2, which are now employed across various tasks. On the other hand, graphs are a ubiquitous type of data, which represent a set of entities and the relationships between them — for example, the Web itself as a massive graph consisting of pages and the hyperlinks between them. Many of these include text attributes, referred to as “textual graphs.”

While LLMs excel at language processing, they do not inherently process graph structures directly. In contrast, graph techniques, particularly Graph Neural Networks (GNNs), are adept at modeling graph structures but lack in language understanding. A question naturally arises: Can we synthesize LLMs and GNNs to take the best of both worlds? This question has spurred interest in combining GNNs with LLMs, potentially enabling a wider range of applications involving textual graphs.

LLMs and GNNs can be synthesized for a wide range of textual graph applications. Image by authors.

Why G-Retriever? A Leap Beyond the Basics

In recognizing the potential of combining the power of LLMs with GNNs for a deeper understanding of textual graphs, we introduce G-Retriever, a novel framework for graph question answering (GraphQA). Unlike previous methods that were limited to basic classifications or straightforward question-answering on simple graphs, G-Retriever is designed for the complex, real-world networks we encounter daily. This represents a leap towards intuitive interaction with graph data.

Imagine being able to ‘chat with your graph’: asking your mind map for an argument essay, requesting an advertisement for a place based on a scene graph, or designing an educational tour through a knowledge graph — all through a chat window 💬. G-Retriever makes this possible by understanding the complexities of textual graphs and responding in a user-friendly manner.

We develop a flexible question-answering framework targeting real-world textual graph applications. Presented here are examples showcasing the model’s adeptness in handling generative and creative queries in practical graph-related tasks: common sense reasoning, scene understanding, and knowledge graph reasoning, respectively. Image by authors.

2. Challenges

The idea of developing a flexible GraphQA framework is undoubtedly attractive, but it comes with its share of challenges.

Hallucination in Graph LLMs

LLMs are prone to hallucination, a phenomenon where the generated content is factually inaccurate or nonsensical. We validate the presence of this issue in graph settings. In particular, we employ a baseline method that adapts MiniGPT-4 to graphs, where a frozen LLM interacts with a trainable GNN that encodes graph data as a soft prompt.

LLM + Graph Prompt Tuning: A baseline method that adapts MiniGPT-4 to graphs. Image by authors.

Our findings, as shown below, indicate that hallucination, an important problem in text-based LLMs, is also prevalent in Graph LLMs.

Observation and mitigation of hallucination: comparative analysis of baseline method and our *G-Retriever*. Image by authors.

Lack of Scalability

Recent research endeavors have explored translating graphs into natural language, such as by flattening nodes and edges into a text sequence, enabling their processing by LLMs for graph-based tasks.

Graph is textulized by listing nodes and edges (as highlighted by yellow), proposed by Chen et al. Image by authors.

Graph is textulized as an edge list (as highlighted by yellow), proposed by Wang et al. Image by authors.

However, these methods face critical scalability issues. Converting a graph with thousands of nodes and edges into a text sequence results in an excessive number of tokens, surpassing the input capacity of many LLMs. For example, when textualizing a graph as an edge list (as shown in the above second example), for each edge, i.e., (source node_id, destination node_id), at least five tokens are needed. Consider a graph with 1,000 edges; at least 5,000 tokens are needed, exceeding the context window limit of most LLMs (e.g., 2048).

Our Solution: Graph Retrieval-Augmented Generation (Graph RAG)

To recap, in developing a framework for GraphQA, we have identified two major challenges: hallucination and lack of scalability. Our strategy to overcome these is using a technique called retrieval-augmented generation (RAG), specifically designed for textual graphs.

Understanding RAG and Graph RAG

Illustration of RAG and Graph RAG. Imaged by authors.

What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is a machine learning technique that enhances the generation of text by incorporating information retrieved from a large dataset or knowledge base into the generative process.
— ChatGPT4

Imagine a student writing a history essay who knows the basics of an event but lacks detailed facts. Instead of writing down unreliable information, the student visits a library to research exact dates, key figures, and locations of the event. By integrating this information, the student’s essay becomes accurate and rich with details. This mirrors how an LLM with RAG operates: it starts with a broad understanding and then consults a vast database for specifics, ensuring the generated content is not only fluent but also factually precise, much like the student’s well-researched essay.

Why Do We Need Graph RAG?

Though RAG has been successfully applied in various areas, it is primarily designed for simpler data types and struggles with the complexities of graphs. Therefore, we introduce Graph RAG, a general approach that should be applicable to all textual graphs.

Unlike traditional RAG methods, which query the knowledge base for relevant information, Graph RAG is designed to query the graph base and retrieve relevant subgraph (including nodes and edges). This is done to augment the generative process of LLMs.

By integrating Graph RAG, G-Retriever can access a broader range of information, ensuring responses are both accurate and richly detailed, addressing the issue of hallucination. Additionally, the size of the retrieved subgraph will be much smaller compared to the entire graph base, which enhances scalability.

3. G-Retriever

With Graph RAG as the core, we will next introduce how G-Retriever integrates the strengths of GNNs, LLMs, and Graph RAG.

A Glimpse at the Key Components

We start by introducing the key components:

LLMs. We opt to use LLMs due to their excellent language processing capability, which can help us better understand the semantics of textual attributes on nodes and edges. Given the large size of LLMs (e.g., 175B for GPT-3), Paramater Efficient Fine-Tuning (PEFT) is necessary. Specifically, we freeze the LLM ❄️ and adopt a soft prompting approach on the output of the GNN. This allows for PEFT while maintaining the LLM’s pre-trained language capabilities.
GNNs. We utilize GNNs to model and comprehend the complex structure of textual graphs. Specifically, we feed the textual graph into a GNN 🔥 and use the output as a soft prompt for the LLM, aiming to inject structural knowledge captured by the GNN into the LLM.
Graph RAG. As mentioned above, Graph RAG is the core technique that is vital for mitigating hallucinations and improving scalability. To adapt RAG for graphs, we formulate subgraph retrieval as a Prize-Collecting Steiner Tree (PCST) optimization problem.

The G-Retriever Workflow

Then, let’s put these components together and introduce the pipeline. G-Retriever comprises four main steps: indexing, retrieval, subgraph construction and generation, as depicted in the figure below.

Overview of the proposed *G-Retriever*. Image by authors.

Step 1: Indexing

We initiate the RAG approach by generating node and graph embeddings using a pre-trained (frozen) LM. These embeddings are then stored in a nearest neighbor data structure for efficient retrieval.

Step 2: Retrieval

For retrieval, we employ the same encoding strategy to the query to ensure consistent treatment of textual information.

Next, to identify the most relevant nodes and edges for the current query, we use a k-nearest neighbors retrieval approach. This method yields a set of ‘relevant nodes/edges’ based on the similarity between the query and each node or edge.

For example, when posed with the question: What is the name of Justin Bieber brother?A set of relevant nodes are retrieved: justin bieber, this is justin bieber, jeremy bieber, justin bieber fan club, justin. Similarly, a set of relevant edges are retrieved: sibling, sibling(s), hangout, friendship, and friend.

Step2: Retrieval; The most semantically relevant nodes and edges are retrieved, conditioned on the query. Image by authors.

Step 3: Subgraph Construction

The goal here is to construct a subgraph that includes as many relevant nodes and edges as possible, keeping the overall size manageable.

It serves two fundamental purposes:

1️⃣ Enhancing Relevance: The primary goal here is to sift through the vast array of nodes and edges, selecting only those that are directly relevant to our query. This selective filtering is critical; it ensures that our LLM focuses on pertinent information, thereby preventing the dilution of valuable insights with irrelevant data.
2️⃣ Improving Efficiency: By controlling the subgraph’s size, we make it feasible to convert the graph into a format understandable by LLMs. This step is pivotal for maintaining computational efficiency and ensuring that the process remains scalable, even as we deal with complex and large-scale data structures.

To achieve these goals, we formulate subgraph retrieval on texual graphs as a Prize-Collecting Steiner Tree (PCST) optimization problem. In this approach, we assigns higher “prize” to nodes and edges that exhibit a stronger relevance to our query, as determined by their cosine similarity identified in the previous step. The objective is to identify a subgraph that optimizes the total prize of nodes and edges, minus the costs associated with the size of the subgraph.

Step 3: Subgraph Construction; A connected subgraph is extracted, covering as many relevant nodes and edges as possible while maintaining a manageable graph size. Image by authors.

Step 4: Answer Generation

Graph Encoder and Projection Layer. We use a graph encoder to model the structure of the retrieved subgraph, and then use a multilayer perceptron (MLP) to align the graph token with the vector space of the LLM.
Text Embedder. To leverage the text-reasoning capabilities of LLMs, we transform the retrieved subgraph S∗ into a textual format, denoted as textualize(S*). Subsequently, we combine the textualized graph with the query to generate a response.
LLM Generation with Graph Prompt Tuning. In the final stage, we generate the answer using the graph token as a soft prompt, with gradients back-propagated to optimize the parameters of both the graph encoder and the projection layer 🔥.

Step 4: Generation; An answer is generated using a ‘graph prompt’, a textualized graph, and the query. Image by authors.

4. GraphQA Benchmark

In addition to G-Retriever, we also aim to establish a benchmark specifically for Graph QA.

A Step Back: Why Do We Need a GraphQA Benchmark?

Question answering is a fundamentally important task in natural language processing. It serves as a key benchmark for assessing LLMs and providing a unified interface for various capabilities. Despite extensive research in QA, there is a notable absence of a comprehensive and diverse benchmark specifically designed for the graph modality. Our study addresses this gap by introducing the GraphQA benchmark, which targets real-world graph applications such as common sense reasoning, scene understanding, and knowledge graph reasoning. This benchmark is crucial for measuring a model’s ability to answer a wide range of questions about graphs across diverse applications.

What is in the GraphQA Benchmark?

The GraphQA benchmark integrates three existing datasets: ExplaGraphs, SceneGraphs, and WebQSP.

💡 It is important to note that these datasets were not originally developed for this work. However, a significant contribution of our research is the standardization and processing of these diverse datasets into a uniform data format suitable for the GraphQA benchmark. These datasets, previously utilized in different contexts, are reintroduced with a new focus tailored for GraphQA.

Summary of datasets used in GraphQA benchmark. Image by authors.

Each entry in the GraphQA benchmark consists of a textual graph, a question related to the graph, and one or more corresponding answers.

llustrative examples from the GraphQA benchmark datasets. Image by authors.

5. Experiments

We evaluate G-Retriever on our GraphQA benchmark, and consider three model configurations: 1) Inference-Only: Using a frozen LLM for direct question answering; 2) Frozen LLM + Prompt Tuning (PT): Keeping the parameters of the LLM frozen and adapting only the prompt; 3) Tuned LLM: Fine-tuning the LLM with LoRA.

Main Result

G-Retriever surpasses all inference-only baselines and traditional prompt tuning across all datasets. The combination of our method with LoRA (i.e., G-Retriever + LoRA) achieves the best performance, outperforming standard LoRA fine-tuning by 5.62% on SceneGraphs and 13.56% on WebQSP.

Performance comparison across ExplaGraphs, SceneGraphs, and WebQSP datasets. This table presents mean scores and standard deviations (mean ± std) for different configurations, including Inference-Only, Frozen LLM , and Tuned LLM settings, across the three datasets. A dashed entry indicates the non-applicability of the ‘Question-Only’ method for SceneGraphs, where questions cannot be answered without the textual graph. Image by authors.

Improve Scalibility

Implementing our Graph RAG led to a significant reduction in the average number of tokens and nodes across datasets, underscoring its potential in managing large-scale graph data.

Average number of tokens and nodes before and after implementing retrieval. Image by authors.

Metigate Hallucination

To evaluate hallucination, we instructed the models to answer graph-related questions, specifically identifying supporting nodes or edges from the graph. We manually reviewed 100 responses from both our method and the baseline (i.e., LLM with graph prompt tuning), verifying the existence of the nodes and edges referenced in the model’s output within the actual graph. G-Retriever significantly reduces hallucinations by 54% compared to the baseline.

Quantitative comparison of hallucination on the SceneGraphs dataset. Image by authors.

👍 In summary, our experiments demonstrate G-Retriever’s superior performance across various settings, outpacing traditional models and highlighting its ability to manage large-scale data efficiently while significantly mitigating hallucination issues. These findings underscore G-Retriever’s potential as a powerful tool for textual graph understanding and question answering.

6. Related Work & Discussion

In the evolving landscape of textual graph understanding, our work is part of a broader discourse that includes notable projects such as GraphRAG (from Microsoft and NebulaGraph), GraphToken (from Google Research), GraphLLM and LLaGA. To elucidate the distinctions and parallels between our approach and these related efforts, we provide a comparative analysis.

Comparision with related work. Image by authors.

Compared with Graph RAG Based Methods

“Graph RAG” is a concept pioneered by NebulaGraph. It is an knowledge-enabled RAG approach to retrieve information from Knowledge Graph (KG) on given task. Typically, this is to build context based on entities’ subgraph related to the task. It begins by identifying entities related to the task, then retrieves a subgraph of these entities (defaulting to a 2-depth) from the KG, and finally constructs context based on this subgraph.

This technology harnesses the power of knowledge graphs in conjunction with Large Language Models (LLMs) to provide search engines with a more comprehensive contextual understanding. It assists users in obtaining smarter and more precise search results at a lower cost.
— NebularGraph

For example, given the question: “Tell me about Peter Quill?” GraphRAG first identify the entity Peter Quill in the question, then retrieves a 2-hop subgraph of Peter Quill from the KG as follows:

RAG subgraph Query (depth=2). Image from https://graph-rag.streamlit.app/?utm_medium=oembed.

By incorporating the retrieved knowledge, it renders the answer: “Peter Quill is a character who has indicated willingness to continue playing and is portrayed by Chris Pratt. He is the leader of the Guardians of the Galaxy and is expected to return for a sequel. He also wrote the Guardians of the Galaxy movie and directed the sequel. Additionally, he is associated with the Guardians of the Galaxy Vol. 3. ”

Microsoft further extended and enhanced GraphRAG from LlamaIndex (from NebularGraph), unlocking LLM discovery on narrative private data.

How Are We Similar and Different?

Our G-Retriever shares the core idea of implementing Graph RAG, with the hope of improving question answering, mitigating hallucination, fitting the data within LLM’s context window, and more. Despite this shared core idea, we have significant differences:

Tasks. They focus on pure NLP tasks and use Graph RAG as a tool to facilitate their question-answering, while in our case, we aim to solve question-answering related to textual graphs.
Type of Graph. They focus mainly on knowledge graphs (KG), whereas we consider a wider range of textual graphs, such as scene graphs and explanation graphs. Technically, our Graph RAG can be generalized to any textual graphs.
The Implementation of Subgraph Retrieval. Their k-hop subgraph retrieval, without subsequent pruning, results in a retrieved subgraph size that can significantly vary depending on the entity’s degree within the KG. In contrast, our method controls the cost on the edge, allowing us to keep the graph size manageable.

Compared with Graph Soft Prompt Tuning Based Methods

GraphToken shares a key similarity with our G-Retriever in that they are both Parameter Efficient Fine-Tuning (PEFT) methods for representing structured data for LLMs. Both approaches freeze the LLM and learn an encoding function that generates fine-tuned soft-token prompts, which are prepended to the textual input. This allows us to train only a trivial number of parameters compared to the total LLM parameter budgets.

Overview of GraphToken. Image from “Let Your Graph Do the Talking: Encoding Structured Data for LLMs”.

However, we also have differences:

The way to encode structured data. GraphToken encodes the graph into a continuous representation, which serves as a soft prompt appended to a the textual input. In contrast, G-Retriever uses a hybrid encoding strategy; in addition to encoda soft prompt encoded by GNN, we also textualize it, converting the graph into a textual format. We have experimentally validated that these two forms of graph representation complement each other, , which boosts performance.
Graph RAG. Our research is the first to apply a retrieval-based approach to general graph tasks, which significantly improves scalability and mitigates hallucination. In GraphToken, they don’t involve any retrieval components.
Datasets & Downstream Tasks. We focus on different reasoning tasks. GraphToken focuses on pure graph applications such as node count, node degree, and shortest path (where the graph doesn’t have textual attributes). In our case, we focus on real-world textual graph applications, such as knowledge graph question answering and scene graph understanding.

Closing Remarks

The diversity in focus — ranging from the types of graphs and tasks targeted to the specific architectural decisions — highlights the vibrant and exploratory nature of current research in the field. We are inspired by the collective effort within the community to push the boundaries of graph understanding, particularly through the integration of LLMs 🎆.

Image adapted from a meme on the internet.

👩🏻‍💻 Homepage: https://xiaoxinhe.github.io/
📄 ArXiv preprint: https://arxiv.org/abs/2402.07630
🔧 GitHub repo: https://github.com/XiaoxinHe/G-Retriever
📧 Email: he.xiaoxin@u.nus.edu

Here’s a fun fact as a thank you for reading to the end :)

‘G-Retriever’ also stands for Golden Retriever 🐕. That’s why we are using an image of a golden puppy with an AI robot face to represent our model.