LLMs and Knowledge — the study continues

10 min readApr 1, 2024

We’ve been seeing a lot of folks writing about Knowledge Graphs and LLMs (Large Language Models); I got a list of links from my colleague Bryon Jacob from his readings over the past weekend. I made my first blog post about this almost a year ago. At that time, I wondered if we (i.e., Knowledge Graph folks) were just jumping on a bandwagon, or are we really on to something that will help LLMs perform better.

A few months go, my colleague Juan Sequeda and I published a paper on this topic; I’ve written about it at some length. The point of that paper was to describe an experiment we performed. The main thesis of the paper was that if we think that a knowledge graph and an LLM work better together than an LLM alone, we should do a bake-off; set up a problem for the LLM+Knowledge Graph to solve, and let the LLM+Relational Database work on the same problem. Collect the data from a few hundred runs, and see who does better, and by how much. The bottom line was that the LLM+KG worked roughly three times as well as the LLM+RDB. The details are a lot more nuanced, and they are available in the blog I mentioned earlier, and the paper that it mentions.

When we put together this experiment, it was our ambition that it could be the start of a systematic treatment of this question. Juan and I have our own idea of what it means to be a knowledge graph, but there are a lot of others. What actually works, and how well? Our intuition told us that graph data was an important aspect. But is it? Maybe some other approach does as well or better. In some conversations, fans of various RDB normal forms told us about some enhanced logical models that could do well at this task. I’d love to see these approaches, and compare. Then we, as a research community, could learn what works and what doesn’t.

Toward this end, we made the data for the benchmark open source, and published it on GitHub. Juan even mentioned specifically that he would be thrilled if there would be a team from Neo4J who would pick up our benchmark and try it in Neo, whatever that means. We can’t do it, we’re not expert in the effective ways to use Neo.

So imagine my excitement when one of my customers recommended a YouTube link, in which Jesus Barrasa does exactly that: he takes our benchmark data, our ontology, our mappings and creates a graph database in Neo according to the best practices that he knows, and got GPT to generate a query. He hasn’t completed the whole benchmark, but in the video suggests that he might. The video is a bit long, and the good stuff doesn’t start until Jesus comes on screen, about 9:30 into the clip. He speaks clearly; you can watch it slightly accelerated.

The GitHub Repository

You can check out the repository on your own, but I want to outline some of the important things that are in it.

ontology/ We included the ontology that we used to drive our “chat with the data” application; this ontology describes, in business terms, what kinds of entities and relationships there are in the domain. This is based on the P & C Data Model for Property and Casualty Insurance from the OMG. We tried to make the ontology involved enough to be interesting, but small enough to be digestible. It is published in the Web Ontology Language (OWL), and serialized as Turtle.

DDL/ We described the database in DDL as well.

investigation/ I put together a small ontology to describe an experiment; you don’t really need this to understand the specification (in fact, I see that I neglected to include this in the git repo). You can pretty much read this model out loud and understand it. There’s a thing call an investigation; it refers (in three ways) to three datasets; one specifies the model (the ontology), one the schema (i.e., the DDL) and the data itself. An investigation pursues a bunch of probes; of which there’s just one type, namely an inquiry. All probes (including an inquiry) expect some response, which itself is a query, in some language (SQL or SPARQL for now).

What I did include in the git repo was the specification, according to this ontology, of the insurance benchmark. There were 44 probes, each with a question in English, and two expected responses; one each of a SQL and SPARQL query.

data/ The sample data. It is made up of a bunch of tables, which I publish in csv.

R2RML A mapping file from the ontology to the tables, expressed in R2RML. If you’re not familiar with R2RML, you can read the spec, or just know that it is a standardized language for expressing how an ontology corresponds to a relational database. You can make very simple correspondences in R2RML (e.g., “the class PC:Policy corresponds to the database table called “policy”) or very elaborate ones, (e.g., “the class PC:PolicyHolder correponds to the selection of lines from the policy table for which the agreement_party_role table specifies that the role is ‘PH’”; this one is expressed in SQL). This tells you everything you need to know to find how statements described by the ontology correspond to statements in the database.

The Neo4J Implementation

I won’t go into detail on the implementation; Jesus did a fine job in that video. But I want to point out a few things.

He focused on the concept called “Policy Holder.” This is a great example, since it doesn’t correspond directly to a table in the database schema. In the R2RML mapping, this was defined as a SQL query, which described how to subset one of the tables to refer to policy holders. This sort of treatment is quite typical when putting a semantic layer on top of a relational database; there is some concept that has some business meaning (in this case, “Policy Holder”) that has no simple representation in the relational database. I can’t say that it has no representation at all; in this case, there is a field in the table with a particular value that means the row corresponds to a Policy Holder; so the concept’s representation in the database can be expressed in a SQL query (and usually a quite simple one).

At the finale of the demo, he generated and ran a query, which got the right answer. It was one of the easier questions, but let’s be clear: he wasn’t sandbagging. The choice was made for its simplicity, ensuring it was easier to understand in the context of a demo webinar. A full experiment will challenge this with a broader range of questions to truly test performance.

Key Observations from Neo4j’s Solution

Even from this simple setup, we get some insights into the consequences of the different design choices in these approaches. One of them has to do with materialization vs. virtualization. This is a well-known trade-off, which has been explored in other settings. In a materialization solution, the data from the original source is transformed and stored in another data resource (“materialization” — the data is made “material” in the new system); in this example, the data was queried from the original relational source and stored in Neo4J. There are a lot of nice features of this; you can spend all the time you want materializing, and the system performs quickly at user time. You can review your transformation and do it again and again until you like it. The disadvantage is that the original data will move on, and the materialized view will quickly fall out of sync with the original (in the case of real-time data, it is out of sync as soon as it is materialized). “Virtualization,” on the other hand, leaves the data where it is, and makes a transformation at query time. This has the advantage that the data is never duplicated, so there’s no issue about synchronization, but there is a possible hit to be taken at user time to do the query transformations. Virtualization platforms (which is a long-standing and successful product category in its own right, even aside from graph data platforms, so doubts about the fidelity of data virtualization are misplaced) deal with this all the time. In many settings, for business or security reasons, duplicating the data is just a non-starter, which is why there is a product category called “Virtualization platforms” in the first place. Presumably, the accuracy of the chat-with-data conversion doesn’t depend on virtualization vs. materialization choices.

Expanding our Knowledge

The point of the original experiment, and our motivation for making the data and the ontology open, was to recruit a research community to investigate what sorts of strategies do and don’t work to improve the performance of LLMs in a chat with your data scenario. This webinar take huge strides toward doing just that, and I can’t applaud Jesus and his team enthusiastically enough for taking this on.

First, it shows an alternative way to manage the data in the experiment. We presented it as tables and described it with a DDL; that’s because relational databases are still the dominant way for data to be managed in industry. But all of us graph data fans know that there’s a lot more you can do with data, once you get it out of the table. We did it one way, Jesus did it another. This is exactly the sort of exploration we want to promote.

Then he took it end-to-end; he connected his graph data representation to the LLM, used a bare-bones prompt (very similar to the one we used, and documented in the paper), and ran the resulting query against the data, getting the expected answer.

What we didn’t see in the webinar is the follow-on analysis; how well does this approach score in comparison to other approaches, which questions does it do well on, but that’s not what the webinar is about.

At the end of the webinar, they fielded two questions from the audience, which are exactly the right questions to be asking of any knowledge-focused approach to question answering. These are questions we’ve been asked as well; they are:

If you have a big ontology, you might run into token limits. What can you do about this?
How do you do quality control on the queries?

Juan and I have got these questions in very specific settings; in fact, our customers have already run into these issues. It turns out, that the fact that an ontology is itself a bunch of RDF data makes the answer to both of these questions a lot simpler. For the first question, you can query your ontology to find a subset that is still relevant to your query; you use the same query language for the ontology that the LLM is using for the data. The second question works the same way; the ontology provides information that you can use to determine whether your query makes sense.

Successful Sharing of Data

I can review this solution all I want, to make comparisons to our solution and to others, and we’ll probably do that elsewhere. But I want to return to one of the themes of this blog in general, which is data sharing. In the very first entry in this blog, I outlined the advantages of data sharing. I’ve written about how just loading data into a database doesn’t actually help to share it. So how do we share data, and actually get someone to use it? The story of how Jesus used the data and metadata in our benchmark, completed an experiment from end-to-end, and we didn’t even know he was doing it. That’s the holy grail of data sharing; collaborative value.

How did we do it? Obviously, the recipient of the data is as involved in this as the suppliers, so Jesus is as much part of this success story as we are. But we’re the ones who made the choice to make our data as FAIR as possible (that is, Findable, Accessible, Interoperable and Reusable) by publishing it using open data and metadata standards.

There’s a tendency when talking about property graphs and semantic web to want to take sides; if we were to do this, then Jesus and I would be on “opposite sides.” But I have always felt that this side-taking is just an artifact of a desire to couch any interaction in military or sports terms; science doesn’t have to proceed that way. So Jesus, someone on the “other side”, is quite fluent in RDF, RDFS and OWL, and was able to read the ontology we published. He didn’t have to ask us what the statements meant, or even ask for any clarification at all. He even got the concept that he focused his study on, namely “policy holder,” by looking at our ontology and R2RML file. He learned everything he needed to know to replicate the experiment, without having to talk to us.

The data itself was in CSV; not a semantic web language. So did we fall down on our FAIR presentation of the data? In some sense, we did; by publishing data as CSV we seriously curtailed its Interoperability. It is difficult or impossible to know if any of the agents mentioned in this data correspond to any agents described anywhere else. But this is dummy data, intended for use in experimental settings; so the “I” in FAIR is less important. We knew we were compromising interoperability when we made this choice, but CSV is pretty easy to work with.

This experience also highlights the fallacy I treated in just load it into a graph database; I did in fact load the data into a graph database; it is available in a public dataset in data.world. But that doesn’t help Jesus at all; the point of the experiment was to understand different ways to represent and manage this data; as a graph, table, or any other way someone can think of. Making the data available (even as csv on github) satisfied this need. And describing the metadata, and the correspondence between that metadata and the data itself, is what allowed Jesus to do what he did. That had nothing to do with where the (meta-)data was stored, and everything to do with how it was presented.

Call for Collaboration

Jesus, by collaborating with us in this investigation, has joined us in treating this topic as a scientific endeavor, and contributed to the advancement of that scientific investigation. Jesus, if you’re reading this, I really want to encourage you to do the other 43 questions and keep score; let’s see what we can learn together about making this work. I’d offer to help you out, but it sounds like you’re doing just fine on your own. I look forward to reading your paper about this.