What is GraphRAG?

Advanced RAG using Knowledge Graphs and LLMs

Mehul Gupta
Data Science in your pocket

--

One of the most exciting applications of Generative AI and LLMs is Retrieval-Augmented Generation (RAG), which lets you interact with external documents like PDFs, text files, and YouTube videos.

GraphRAG crash course with codes is live now

This post covers the below topics:

What is RAG and KnowledgeGraph?

Issues with baseline RAG

How GraphRAG works?

Advantages of GraphRAG over naive RAG

Recently, a new advancement to improve naive RAG is introduced called GraphRAG which uses Knowledge Graphs over Vector DBs for finding relevant information from external documents when a user inputs a query. This post talks about GraphRAG and its advantages over baseline RAG

My debut book: LangChain in your Pocket is out now !!

But before we jump on GraphRAG, you need to know two major concepts,

1. How RAG works?

RAG takes a user’s query and:

Searches a vector database (prepared using external documents) for relevant information using vector similarity.

Selects the top relevant documents.

Extracts useful content.

Combines this content with an LLM to generate an answer.

As I’ve already explained RAG in quite some detail, I’m skipping it for now:

2. What is a Knowledge Graph?

A knowledge graph is a structured representation of information, capturing entities, their attributes, and relationships. It models complex data and highlights connections within a domain Some key components of a Knowledge Graph are:

Entities: The fundamental units of a knowledge graph, representing real-world objects, concepts, or things (e.g., “Albert Einstein,” “Theory of Relativity,” “University”).

Attributes: Properties or characteristics of entities (e.g., “Albert Einstein” has an attribute “birthdate” with the value “March 14, 1879”).

Relationships: Connections between entities that describe how they are related to each other (e.g., “Albert Einstein” is related to “Theory of Relativity” by the relationship “developed”).

Nodes and Edges: In a graphical representation, entities are nodes, and relationships are edges connecting these nodes.

Consider a simple knowledge graph about scientific discoveries:

  1. Entities: “Albert Einstein,” “Theory of Relativity,” “Speed of Light,” “Photoelectric Effect”
  2. Attributes: “Albert Einstein” (birthdate: “March 14, 1879”), “Theory of Relativity” (published: “1905”)
  3. Relationships:
  • “Albert Einstein” developed “Theory of Relativity”
  • “Theory of Relativity” relates to the “Speed of Light”
  • “Albert Einstein” proposed the “Photoelectric Effect”

It might look something like this

Coming back to GraphRAG

The basic RAG implementation has a serious issue. Consider this example:

Suppose a company has a large collection of internal documents, including research papers, technical reports, emails, and meeting notes.

The goal is to answer the question: “What are the recent advancements in our AI research department?”

Retrieval:

Searches the document collection for terms like “recent advancements” and “AI research department.”

Retrieves the top documents based on vector similarity (e.g., documents containing similar phrases).

Response:

Lists several documents or passages that mention advancements in AI research.

Struggles to connect insights across different documents, often presenting isolated pieces of information without synthesis.

The final output may retrieve these sentences:

  • Document 1: “Our team has recently developed a new AI model for natural language processing.”
  • Document 2: “In the past quarter, we made significant progress in AI-based image recognition.”
  • Document 3: “AI Advancements are at a rapid pace”

As you must have noticed, this approach misses on :

  • Connecting the Dots: Might not link related advancements spread across the document but not directly mentioned. Also, it is very much driven by text similarity even if the sentence is a filler text (3rd output)
  • Holistic Understanding: Might miss the overarching trends or themes because it focuses on similar phrases rather than understanding the context.

Here comes GraphRAG

GraphRAG

Graph RAG, as mentioned earlier, uses KnowledgeGraphs instead of Vector DBs for information retrieval, hence the output is more wholesome and meaningful compared to baseline RAG.

How GraphRAG Works?

GraphRAG uses an LLM to automatically extract a rich knowledge graph from a collection of text documents.

The knowledge graph captures the semantic structure of the data, detecting “communities” of densely connected nodes at different levels of granularity.

These community summaries provide an overview of the dataset, allowing the system to answer global queries that would be difficult for naive RAG approaches.

When answering a user’s question, GraphRAG retrieves the most relevant information from the knowledge graph and uses it to condition the LLM’s response, improving accuracy and reducing hallucinations.

Some major advantages of using GraphRAG over baseline RAG are:

Uses knowledge graphs to give more complete and varied responses compared to basic RAG.

Generates responses that are better connected to the original data, and can show where the information comes from.

Provides overviews of the dataset at different levels, so users can understand the overall context without needing specific questions.

Can be more efficient than summarizing the full text, while still generating high-quality responses.

With this, I will wrap up this post. We will explore how GraphRAG can be implemented in my next post!

By that time, you can explore the repo by Microsoft for GraphRAG implementation here

--

--