The 3-Step Process for creating RAG-Native Knowledge Graphs

Published in

WhyHow.AI

9 min readAug 15, 2024

At WhyHow.AI, we care about what deterministic and rules-based RAG-Native Graphs look like, challenges in the graph creation and retrieval process, and how to solve them. This article will provide a framework for the process to get to accurate knowledge graphs.

With Knowledge Graph RAG, there are 3 main aspects to care about when debugging and optimizing:

Is the Schema correct? Is the schema appropriate to capture the information you care about?
Is the Information in the graph? Does the information in the graph, defined by the schema, appropriately and exhaustively represent the raw unstructured data?
Is the Question understood? Is the question being translated into the right graph extraction logic?

One of the things that attracted me to knowledge graphs as an instrument for information retrieval is that the process of knowledge graph and information representation is elegantly simple. These 3 aspects are so simple and straightforward, they can also be identified and amended by non-technical domain experts. This increases the ability for multiple people to debug a system, facilitating a faster GTM for KG RAG systems.

We can see that each of these 3 aspects are related to each other. When one aspect is suboptimal, even if the other aspect is optimal, the answer that returns may be incorrect, and so granular tooling to understand, debug and optimize each aspect is required.

Step 1: Is the Schema Correct?

Granular control of the schema is to allow you to control what you want to capture in the graph itself. If the schema is not correctly framed, the information would structurally be unable to be captured in the graph. We want to help streamline the brainstorming, creation, and editing of schemas.

You can see a few of our features for our Knowledge Graph Studio currently in Beta that will aid in this process:

Schema Descriptions

Generate and granularly define your schema with descriptions. Control the entities and relationship types, not just by mentioning the type of categories, but defining the entities and relationships through descriptions. This helps especially to distinguish between categories and relationships that may have similar words but need to be distinguished clearly. In this powerful approach, we recognize that capturing entities and relationships for RAG-Native graphs goes beyond the historically common types like ‘Person, Organization, Places’, and need to reflect the things that you care about in your private data that overwhelmingly don’t reflect common entity types.

Schema Hierarchies

Through the “Patterns” in our schema, we get to define the hierarchical relationships between entity types. Such hierarchical structures are how a lot of data is needed to be structured for retrieval.

Co-Pilot for Schema Generation — Generate Schema from Questions

We know that schema generation may sometimes be an intuitive process. We are working on automated schema generation tools and processes. In the meantime, we are experimenting with schema generation co-pilot processes. Schema generation will require human-in-the-loop oversight because the schema reflects both what you care about, and how you want to structure the information you care about. This context requires human input. One of the ways we are trying to automate context injections is through the ‘Generate Schema from Questions’ feature. By inserting a sample range of questions that you are interested in asking the system, you can generate a list of potential relevant entities, relationships, and patterns to streamline and automate schema construction.

Step 2: Is the Information in the graph?

The focus here must be in allowing users to quickly zoom in and out of the graph to understand if the information is in the graph. This can be a failure of entity extraction or entity resolution despite the correct schema being used. The benefit of graphs over vectors is that we can easily see if information is missing (as a relation, or from a specific chunk).

There are a few reasons why LLMs and models are unable to accurately turn your unstructured text and schema into the graph you want. These reasons are:

An inability to recognize and extract that the terms in the text are what your schema is looking for. This may be a reflection of a limitation of the model particularly for domain-specific terms, or the fact that your schema is requesting something that requires complex reasoning (i.e. “entity is everything that Jack has mentioned twice before page 15”). With our SDK, you can plug in your own set of preprocessed data into our Graph Studio.
An inability to perform entity resolution. There may be instances where the model is unsure about whether two different strings of texts are the same entity. This can be as trivial as different spelling to different phrases that mean the same thing. There are many instances where this is also a use-case specific opinion that the system designer must express — do they want to establish a distinction in the graph between two terms (e.g. ‘Downtown San Francisco’ and ‘San Francisco Business District’)

You can see a few features that will aid in the process:

Active Learning Named Entity Resolution (NER)

With our entity resolution workflow, your input in your entity resolution workflow will be used by the system to constantly learn how you resolve and merge different entities, so that it will create better entity extraction the way that you define it. The entity resolution rules can be saved and automatically implemented, allowing your graph creation process to be smarter and better over time.

Active Learning Entity Extraction & Chunk Visualization

You can now look at the chunks directly, highlight words that belong to particular entities and have a node directly created, and then tie that node to a triple. Our system will learn over time the way that you specifically think of entity extraction, and will increasingly create better entity extraction the way that you define it. The entity extraction rules and logic will only be used to improve your specific entity extraction workflows.

Search through the Graph

Regex & Full Text Search: You can now search through the graph quickly to identify the sub-graphs and areas of the graph that you care about for specific entities and relations.

Multiplayer Graph Creation

We believe that graph creation is a multiplayer experience that brings in multiple different stakeholders. Our UI/UX, Graph Sharing capabilities and Chunk Dashboards, alongside a powerful SDK is designed for the ability for multiple different types of stakeholders and personas to more easily collaborate on a range of knowledge bases. Developers, non-technical domain experts, and LLMs can collaborate on the same graph creation process through the platform and the SDK simultaneously.

Source: https://info.tigergraph.com/gartner-knowledgegraphs-2024

Step 3: Is the Question understood?

Deconstructing a natural language question into the relevant specific entities and relations that should be extracted can sometimes fail. This can be despite the fact that the correct and complete information lies in the graph.

To overcome this, one can employ specific query deconstruction techniques and entity recognition techniques. However, to first identify that there are issues with the query deconstruction process, as opposed to missing information in the model, it is important to be able to test the graph extraction with the ‘right’ entities and relations, as compared to the query. This is to see which part of the process there is a failure.

You can see a few features that will aid in the process:

Structured Querying

With Structured Querying currently available in our SDK, you can specify the precise

query_structured(graph_id: str, entities: list[str] | None = None,
relations: list[str] | None = None, values: list[str] | None = None) -> Query

The process of turning a question into the right entities and nodes to be extracted can be conceptually simple or extremely complicated, depending on the nature and purpose of the graph created. In a straight-forward implementation of KG & RAG, we simply identify the relevant entities and relationships that the question mentions and retrieve accordingly. In a more complex implementation which may require multi-hop retrieval, some level of reasoning about the relevant nodes and relationships may be required. Agentic systems may also be required if the graph represents specific types of data, including SOPs or reasoning traversal pathways.
As such, although we are building and providing a graph query engine, we focus on exposing the endpoints for folks to experiment and implement their own graph query logic.

Vector Chunk Linking to Graph Nodes

Reducing the full context of an unstructured piece of text into just triples (entity-relationship-entity) can reduce the contextual information that exists within the text. Chunk Linking, a feature we have talked about before, therefore represents the vector chunk as a node to be referred to and linked to existing triples so that the full context can be brought in to answer a question. In this way, tying knowledge graph nodes to vector chunks means that there is a lower burden to use the graph itself to represent all the underlying knowledge in the knowledge base. This reduces the precision needed to retrieve the right context from the graph because there is an added margin of error when it comes to the types of entities and relationships to identify for extraction.
A recent paper (HybridRAG) by Blackrock and Nvidia proves that being able to perform both hybrid Vector and Graph RAG across a document is superior to simply just Vector or Graph RAG alone. Having the flexibility to choose between constructing the answer from Vector Chunks or through the Graph triples is something that can be easily done through our SDK.
In our SDK, you can pass:

query_unstructured(graph_id: str, query: str, 
return_answer: bool = True, include_chunks: bool = False) -> Query

By passing the natural language query into the graph, you can return all the nodes, triples and chunks in a structured manner, and can decide to construct the answers from either the vector chunks or the graph triples, allowing seamless interoperability between graph or vector retrieval in just a single query.

A key difference in our platform is that chunks are regarded as first-class citizens, given that they are ultimately the source of truth for the knowledge graph. Being able to look at the underlying vector chunk that a triple comes from allows us to better understand and manipulate the information that is represented in the graph.

WhyHow.AI’s Knowledge Graph Studio Platform (currently in Beta) is the easiest way to build Agentic & RAG-Native Knowledge Graphs, combining workflows from developers and non-technical domain experts.

If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG for accuracy, memory and determinism, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our Discord.

Our Knowledge Graph Studio Platform is currently in Beta. If you are interested in helping to beta-test our platform and give us feedback, use of our Beta Access Codes here (If they are not working, they have been claimed already):

4e13e76e-11ca-4e51-a04b-60e1abda8e49

9462ba6f-a0af-4150–9c29–648393940fe7

108d24a5–2338–47d0-bcce-d0ca078c2a1f