Microsoft GraphRAG with an RDF Knowledge Graph — Part 2

Uploading the output from Microsoft’s GraphRAG into an RDF Store

4 min readAug 17, 2024

If you’ve followed the steps in Part 1 of this series then you should hopefully have some parquet files that were created. We’re going to take those parquet files and transform them into a turtle format file to describe our linked data. You can read about the turtle format here. We’ll then import that turtle file into an RDF store. But before we import that file we’ll import an ontology file I’ve created to describe the Microsoft GraphRAG. We’ll also create a vector index that we can use for our semantic searching. Having done all that we’ll be in a position to do RAG with the help of our RDF Knowledge Graph. That will be in Part 3. So, the steps in this Part are:

Import Ontology into an RDF store
Create and import data that fits the Ontology into the RDF store
Create a vector index

Import Ontology into an RDF store

An ontology is a formal description of knowledge as a set of concepts within a domain and the relationships that hold between them. They are an important basis for Knowledge Graphs. You can read more about them here. The W3C have created a specification for a format that allows us to define them called OWL2. I’ve created an ontology for the Microsoft GraphRAG that we’ll be using and it looks like this:

I created this ontology using a great product called Metaphactory and then exported it as an OWL file. Check the link in the Resources section to get a copy of this ontology file together with all the other files (and notebooks) related to this part.

Ontologies are an important part of Knowledge Graphs. They allow for the definition of the data and the relationships in a standardised way. They also provide the ability to validate the data and to explain it. It’s really important that the systems we create aren’t simply black boxes that no-one can understand. We need to be able to explain our systems and outputs and verify our data that is input.

First we’ll create an empty repository in an RDF store and then we’ll import the ontology file. I’m using GraphDB as my RDF store (since they provide a free version), so you would create a new repository (I named it msft-graphrag) like this:

Once you’ve created the repository you can then import the ontology OWL file I’ve provided:

When you import it, make sure you specify “The default graph”:

Create and import data that fits the Ontology into the RDF store

Now we’ve imported the ontology into the RDF store we are safe to create and import the data.

I have created a Jupyter notebook that reads in the parquet files produced as output from Part 1 and then creates a corresponding turtle file that can be imported into the graph. If all you want to do is import the data without wondering how it was created, I have also provided 2 turtle files that you can use. One that was done with 300 chunk sizes, and the other that was done with 1200 chunk size and correlation data. Checkout the resources at the end for those files.

I have also created another notebook to analyse the data that is now in our Knowledge Graph. That’s also on my Github.

Create a Vector Index

In the last part of my Jupyter notebook I show how to create a vector index for the data using Elasticsearch. There is a free version of this product that you can use which you can download from here.

We’ll create an index for the Entity class, storing the description_embedding field. This field is in the parquet file for the entities so we’ll tie it to the id field.

Having created this index we will use it in Part 3, together with our RDF Knowledge Graph, to generate better responses to users questions.

Resources

Tomaz Bratanic has done some blogs and also created some notebooks about Microsoft GraphRAG that are extremely good and informative. Although they’re geared towards Neo4j and use CYPHER, you can learn a lot from them — https://github.com/tomasonjo/blogs/tree/master/msft_graphrag
Notebooks and data to accompany this series of articles are available on my Github — https://github.com/ianormy/msft_graphrag_blog
Data is available on my Google drive — https://drive.google.com/drive/folders/1JzMiaOo3UomwPlhn0_g7j8xsnxvEHDqf?usp=sharing