LLMs, Knowledge Graphs and Property Graphs

Dean Allemang
9 min readApr 9, 2024

In response to my last blog, LLMs and Knowledge — the study continues, I got a lot of questions LinkedIn, which I decided to answer in another blog, since there was a lot of overlap. Just a reminder, that blog was in response to a YouTube Video from Jesus Barrasa of Neo4j, where he loads the data from our “chat with data” benchmark into Neo4j. Many of these were private communications, so I won’t thank them by name, but I do want to thank those who are following this on LinkedIn for your continued questions and discussion.

Review: Using Neo4j with the benchmark data

One of the questions asked about detailed instructions for how to load this data into Neo4j; my response there was that the only instructions I have are what we see in the YouTube Video from Jesus Barrasa that prompted this discussion. The video is pretty long, so I’ll include timestamps for what happens.

First, Jesus reviews the materials that are in the github repository; at 43:16 he examines the R2RML mapping file that describes which classes and properties in the ontology map to which tables (or, as he emphasizes, parts of tables) and columns (respectively) in the data; at 45:20 he examines the ontology expressed in OWL (specifically as Turtle)¹. The data itself is in a bunch of CSV’s; I don’t think he reviews those, that is pretty straightforward and familiar to his viewers.

Now, Jesus has some facilities for processing semantic web data and metadata with Neo4j; he uses these facilities to process the OWL into a graph structure that he can show in the Neo4j viewer; you can see him run this at 46:00. He explains it some detail, and you can see the graph view of the model at 48:10 in the video. He has limited the ontology to just three classes, because these are particularly interesting ones; one reason that they are interesting is that PolicyHolder and Agent both correspond to the same table in the underlying database (“party”). That is, the business has two concepts (PolicyHolder and Agent), which the database models as a single table. You can check his work; the diagram snippet that he shows at 48:10

Screeshot from the Neo implementation

matches the model in our original paper

Snapshot from the original paper (highlight added) https://arxiv.org/abs/2311.07509

Or at least, nearly so; Jesus has chosen to model the relationship between a Policy and its PolicyHolder in the direction of Policy hasPolicyHolder PolicyHolder, whereas in the original we did it the other way around, PolicyHolder hasPolicy Policy. This is more of a stylistic choice, and makes no difference in the representation.

Now that he has this model in Neo4j, he can map the CSV data into a graph. He spends some time setting up the mapping using a user interface that is made for this purpose; most of the mappings are pretty obvious, but he focuses some time on one that isn’t obvious; an Agent in the ontology maps to the party table, when the value of Party_Role_Code equals the value “AG”. You can see this right at 52:30. This information was of course already outlined in the R2RML code that he showed way back at 43:16; I can’t tell from this video whether he got the information for his mapping by looking at the R2RML code (i.e., we were “telling” him) or if he “figured it out” again.

Now that he has the mappings, he can convert the data into Neo4j. He has a program that can take the output of the metadata converter we saw at 46:10, plus the mappings we saw at 52:30, and run them against the raw data. He comments at 53:58 that the run itself takes very little time, since the dataset is so small.

Now the data is in Neo4j, and it has been aligned to the ontology. That is, it no longer has references to the original table and column names from the database, it is written in the business language of the ontology.

Now he does something like what I described in my data vs metadata blog; he makes a summary of the metadata description of this data. He actually does this twice; we see the summary of the whole model at 46:41, but he has made an expurgated version that you see at 59:52 that appears in the expanded prompt to be sent to the LLM.

The rest of the video is the actual question answering itself; armed with this terse description of the model, and a question, the LLM can write an accurate Cypher query to answer the question.

Enterprise Data Modeling and LLMs

Another question that I got was the difference between using a property graph vs an ontology for enterprise data modeling. The first thing to point out is that this question itself makes a false dichotomy; Jesus makes it very clear the role that an ontology plays in constructing a business-level data graph from a more ‘physical’ data representation. He even shows how it is possible to reverse engineer an ontology from such a data structure; the ontology is there in the property graph, whether you put it in explicitly (like Jesus did in this example), or if it just grew up there organically. And when it came to talking to the LLM, Jesus, just as I did in our benchmark, sent a representation of the ontology to the LLM.

Now that I have got that out of the way, there are some aspects of Enterprise Data Management that I can opine on; and that is whether it is valuable to express an ontology in some standard form that has a logical meaning, rather than a meaning given by the execution of some program. In this example, the ontology for the insurance benchmark was expressed in the Web Ontology Language OWL; a language whose specification is given in a variant of first-order logic called Description Logic. This logical foundation implies that the meaning of an OWL ontology can be understood independently of any program that processes it. We actually saw that happen in this example; we were able to communicate our business model to the Neo4j team without having to provide a program that can process it; in fact, Neo already had such a program of their own, which Jesus demonstrated at 45:33.

But the value of having an explicit ontology goes beyond the value of publishing self-describing metadata on the web (though that’s certainly a big one; just ask industry groups like the EDMC how they publish their industry models). Within an enterprise, different divisions will have data that has been represented (“physically”, if you will) in different ways; this would correspond to different sets of CSVs in the benchmark example. And different business units might have different ways to talk about that data. A sustainable enterprise data management policy needs to treat each of these as a resource it its own right; each database, each business ontology, and each mapping from one of these to another. Unless we want to take on the responsibility to dictate operational procedures to different lines of business(specific divisions or sectors within the company), we need to provide a marketplace where all of these things can interact.

Then there’s the industry level. The EDMC had a wild idea way back in 2010 that in order to understand the banking failures that led to the crisis of 2008, you have to treat the data in an entire INDUSTRY as a governable entity. We can’t settle even for applications that manage to keep the data in line for a single banking institution, we need to know how that data is managed across the industry. When you think of the charter of the EDMC, which was, as early as 2010, to provide a means for managing data across the entire banking industry, you realize that this was a Big Audacious endeavor. This can’t even be imagined without treating structured business vocabularies and data models (let’s just call these things “ontologies” for short) as separate items to be managed in their own right, alongside data standards, reference data sets, and other industry-level data resources.

Impact on LLMs

So ontologies, particularly independently governed and managed ontologies, are essential for sustainable enterprise and industrial level data management. But what about LLMs?

In this example, Jesus showed how a simple summary of an ontology can inform an LLM well enough to provide a correct query in response to an English language question. He did a lot of work to process the published ontology; he converted it to a form that Neo4j could use, used it to provision a Neo4j dataset that aligned with it, then ran the Cypher query from that summary against that dataset. There’s clearly some valuable information in that ontology summary; the LLM wrote the query without seeing even one single example of the data it was querying over.

This is the same thing that we did in our original benchmark study; we provided the ontology to the LLM and asked it to build a SPARQL query. But we didn’t process the ontology at all; we took the output directly from gra.fo (a general-purpose ontology editor, and the source for the model diagram I quoted above) and fed it right to the LLM. We mapped the ontology to the data, exactly as Jesus did (the UI we have for that mapping is a bit different from the one that Jesus has, but it accomplishes the same result), and, just as in this example, that mapping itself is not visible to the LLM; it doesn’t need to know it. Our examples show that the LLM (well, GPT3.5 and GPT4 at least) already understand OWL quite well.

Another advantage of an explicit ontology artifact is the ability to manage it on its own. In his treatment, Jesus used his ontology management tools to condense the ontology (you see him do that when he selects classes to include at 45:00 an when he selects properties to include starting at 48:20). Nowadays, the selection of part of a large, relevant document to help an LLM focus is the topic of a process called RAG (“Retrieval Augmented Generation”; you use some Retrieval mechanism to select the information you will use to Augment the Generative AI performance). In this case, Jesus illustrated how the process works by doing it by hand “(Researcher Augmented Generation”?); in a larger scale system, this process would be automated.

In our benchmark, we found that this ontology was small enough that we didn’t need to do any RAG on it, and we also found that the LLM already speaks OWL well enough that it was able to sort out what part of the ontology is relevant to the question on its own. But some of our customers have developed ontologies large enough that they don’t fit into context windows, so RAG has become important. Working alongside those customers, we’ve learned a lot about effective ways to do classic RAG on ontologies (using a vector database), as well as more knowledge-directed ways to do RAG that utilize the structure of OWL. None of this is possible without an explicit representation of the ontology.

So, in addition to an explicit ontology allowing your business units to communicate, and allowing your industry trade organizations to communicate with multiple institutions, it also allows you to communicate to an LLM more readily. The LLMs already understand ontological language, and the structure of that language facilitates a variety of RAG strategies. This really isn’t surprising, since the LLMs are, above all else, masters of communication. Anything that helps people and organizations communicate is likely to be embraced by an LLM.

A New Horizon for Ontologies and LLMs

So my take on this is that the LLMs have really ushered in a new day for ontologies; whether you use them to provision a property graph database like we saw Jesus do here, or to organize industry-level cooperation like the EDM Council does, ontologies as first-class entities in sustainable data management are key. And don’t just take my word for it; the LLMs seem to agree.

— — — — — — — — — — — — — — — — — — — — — — — — — -

¹ For those of you not familiar with how OWL is expressed in a text file, first off, OWL is expressed as a graph, in RDF; that is, an OWL file is just a bunch of triples. There are several ways to write triples into a file; think of this like cursive vs handwritten printing vs a print font like Times New Roman; the message is exactly the same, but the printing looks very different. We have different “serializations” of RDF for different purposes. One of the most popular serializations for RDF is called Turtle; Jesus mentions this in his video.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.