GNN-RAG: combining LLMs language abilities with GNNs reasoning in RAG style

8 min readJun 2, 2024

Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG.

In this paper [1], authors introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style.

Key contributions:

Framework: GNN-RAG repurposes GNNs for KGQA retrieval to enhance the reasoning abilities of LLMs.
Effectiveness & Faithfulness: GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ).
Efficiency: GNN-RAG improves vanilla LLMs on KGQA performance without incurring additional LLM calls as existing RAG systems for KGQA require

Problem Statement & Background

i) KGQA

given a KG G that contains facts represented as (v, r, v′), where v denotes the head entity, v′ denotes the tail entity, and r is the corresponding relation between the two entities.
Given G and a natural language question q, the task of KGQA is to extract a set of entities {a} ∈ G that correctly answer q

ii) Retrieval & Reasoning

As KGs usually contain millions of facts and nodes, a smaller question-specific subgraph Gq is retrieved for a question q, e.g., via entity linking and neighbor extraction.

iii) GNNs

KGQA can be regarded as a node classification problem, where KG entities are classified as answers vs. non-answers for a given question

iv) LLMs

LLMs for KGQA use KG information to perform retrieval-augmented generation (RAG) as follows.
The retrieved subgraph is first converted into natural language so that it can be processed by the LLM.
The input given to the LLM contains the KG factual information along with the question and a prompt

v) LLM-based Retriever

authors present an example of an LLM-based retriever (RoG; [4]).
Given training question-answer pairs, RoG extracts the shortest paths to the answers starting from question entities for fine-tuning the retriever.
Based on the extracted paths, an LLM (LLaMA2-Chat-7B [Touvron et al., 2023]) is fine-tuned to generate reasoning paths given a question q as LLM(prompt, q) =⇒ {r1 → · · · → rt}k,

GNN-RAG

i) Workflow

Workflow of GNN-RAG is as follows:

First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question.
Second, the shortest paths in the KG that connect question entities and GNN-based answers are extracted to represent useful KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG.
GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA

ii) GNN

GNNs were used for retrieval due to their architectural benefit of exploring diverse reasoning paths that result in high answer recall.
When GNN reasoning is completed, all nodes in the subgraph are scored as answers vs. non-answers based on their final GNN representations h(L) v , followed by the softmax(·) operation
GNN parameters are optimized via node classification (answers vs. non-answers) using the training question-answer pairs. During inference, the nodes with the highest probability scores, e.g., above a probability threshold, are returned as candidate answers, along with the shortest paths connecting the question entities with the candidate answers (reasoning paths).
retrieved reasoning paths are used as input for LLM-based RAG

iii) LLM

After obtaining the reasoning paths by GNN-RAG, we verbalize them and give them as input to a downstream LLM, such as ChatGPT or LLaMA. However, LLMs are sensitive to the input prompt template and the way that the graph information is verbalized
For addressing that issue, RAG prompt tuning[2][3] was used for LLMs that have open weights and are feasible to train
A LLaMA2-Chat-7B model is fine-tuned based on the training question-answer pairs to generate a list of correct answers, given the prompt: “Based on the reasoning paths, please answer the given question.\n Reasoning Paths: {Reasoning Paths} \n Question: {Question}”.
The reasoning paths are verbalised as “{question entity} → {relation} → {entity} →· · · → {relation} → {answer entity} \n”, as shown in figure above for GNN-RAG

iv) Retrieval Analysis: Why GNNs & Their Limitations

GNNs leverage the graph structure to retrieve relevant parts of the KG that contain multi-hop information.

authors trained two different GNNs, a deep one (L = 3) and a shallow one (L = 1), and measure their retrieval capabilities
reported the ‘Answer Coverage’ metric, which evaluates whether the retriever is able to fetch at least one correct answer for RAG
Table below shows shows GNN retrieval results for single-hop and multi-hop questions of the WebQSP dataset compared to an LLM-based retriever, where ‘#Input Tokens’ denotes the median number of the input tokens of the retrieved KG paths

The results indicate that deep GNNs (L = 3) can handle the complex graph structure and retrieve useful multi-hop information more effectively (%Ans. Cov.) and efficiently (#Input Tok.) than the LLM and the shallow GNN.
On the other hand, the limitation of GNNs is for simple (1-hop) questions, where accurate question-relation matching is more important than deep graph search

v) Retrieval augmentation (RA)

Retrieval augmentation (RA) combines the retrieved KG information from different approaches
authors present a RA technique (GNN-RAG+RA), which complements the GNN retriever with an LLM-based retriever to combine their strengths on multi-hop and single-hop questions, respectively. to increase diversity and answer recall
A downside of LLM-based retrieval is that it requires multiple generations (beam-search decoding) to retrieve diverse paths, which trades efficiency for effectiveness
A cheaper alternative is to perform RA by combining the outputs of different GNNs, which are equipped with different LMs
authors GNN-RAG+Ensemble takes the union of the retrieved paths of the two different GNNs (GNN+SBERT & GNN+LMSR) as input for RAG

Experimental Results

i) Main Results

Table below presents performance results of different KGQA methods

Results above shows that GNN-RAG performs overall the best, achieving state-of-the-art results on the two KGQA benchmarks in almost all metrics
results show that equipping LLMs with GNN-based retrieval boosts their reasoning ability significantly (GNN+LLM vs. KG+LLM).
GNN-RAG+RA outperforms RoG by 5.0–6.1% points at Hit, while it outperforms or matches ToG+GPT-4 performance, using an LLM with only 7B parameters and much fewer LLM calls
GNN-RAG+RA outperforms ToG+ChatGPT by up to 14.5% points at Hit and the best performing GNN by 5.3–9.5% points at Hits@1 and by 0.7–10.7% points at F1.

ii) Multi-Hop & Multi-Entity KGQA

Table below compares performance results on multi-hop questions, where answers are more than one hop away from the question entities, and multi-entity questions, which have more than one question entities

GNN-RAG leverages GNNs to handle complex graph information and outperforms RoG (LLM-based retrieval) by 6.5–17.2% points at F1 on WebQSP and by 8.5–8.9% points at F1 on CWQ
GNN-RAG+RA offers an additional improvement by up to 6.5% points at F1, showing that GNN-RAG is an effective retrieval method when deep graph search is important for successful KGQA.

iii) Retrieval Augmentation

Table below compares different retrieval augmentations for GNN-RAG

Results above shows that GNN-based retrieval is more efficient (#LLM Calls, #Input Tokens) and effective (F1) than LLM-based retrieval, especially for complex questions (CWQ)
Retrieval augmentation works the best (F1) when combining GNN-induced reasoning paths with LLM-induced reasoning paths as they fetch non-overlapping KG information (increased #Input Tokens) that improves retrieval for KGQA
Augmenting all retrieval approaches does not necessarily cause improved performance (F1) as the long input (#Input Tokens) may confuse the LLM
Although the two GNNs perform differently at KGQA (F1), they both improve RAG with LLMs, however, weak GNNs are not effective retrievers

iv) Retrieval Effect on LLMs

Table below presents performance results of various LLMs using GNN-RAG or LLM-based retrievers (RoG and ToG).

authors report the Hit metric as it is difficult to extract the number of answers from LLM’s output.
GNN-RAG (+RA) is the retrieval approach that achieves the largest improvements for RAG
GNN-RAG substantially improves the KGQA performance of weaker LLMs, such as Alpaca-7B and Flan-T5-xl. The improvement over RoG is up to 13.2% points at Hit, while GNN-RAG outperforms LLaMA2-Chat-70B+ToG using a lightweight 7B LLaMA2 model.

v) Case Studies on Faithfulness

Two case studies below illustrate how GNN-RAG improves the LLM’s faithfulness. In both cases, GNN-RAG retrieves multi-hop information that is necessary for answering the complex questions.

case study below illustrates the benefit of retrieval augmentation (RA). RA uses LLMs to fetch semantically relevant KG information, which may have been missed by the GNN

Paper: https://arxiv.org/abs/2405.20139

Code: https://github.com/cmavro/GNN-RAG

References

GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning by Mavromatis et al.arXiv:2405.20139
Ra-dit: Retrieval-augmented dual instruction tuning. by Lin et al.arXiv preprint arXiv:2310.01352, 2023
Graph chain-of-thought: Augmenting large language models by reasoning on graphs. by Zhang et al. arXiv preprint arXiv:2404.07103, 2024
Reasoning on graphs: Faithful and interpretable large language model reasoning. by Luo et al. In International Conference on Learning Representations, 2024