LLM’s Closing the KG Gap

Dean Allemang
12 min readMar 23

--

I have been doing successful knowledge graph deployments with my clients for just a bit over ten years; usually with one of the big banks, but also in pharmaceutical companies and media companies. There’s a pattern to how a lot of these projects go, and that pattern makes use of a lot of standard and out-of-the-box technologies (like Semantic Web standards and RDF databases). But there’s one aspect of the project that requires human judgement; that’s where we link an ontology to the particulars of one or more databases schemas. For example, just because the ontology calls something a “Customer” doesn’t mean that a database schema will; the schema might use a similar word (like “Client” or “Guest”), or even a nonsense string (like “Concept123”). It takes a human to make the connection. Or at least it has, up until now.

It seems that every day we hear about something else that Large Language Models like GPT can do that we hadn’t thought of before. Just last week, we saw Nine ChatGPT Tricks for Knowledge Graph Workers, which shows how we can use a LLM to assist with knowledge graphs. In this blog, I’m going to do a deep-dive to show just how ChatGPT can be used to close the judgement gap for knowledge graph construction and deployment, by suggesting mappings between an ontology and a data schema.

This is going to happen in three parts:

  1. We’ll look at an ontology in OWL, and a database, and talk about how to map them one to the other.
  2. Using this mapping and a virtualization engine like the one that data.world has, show how we can write a simple, business level query to answer a deceptively simple question.
  3. Show how we can prompt ChatGPT to write that mapping. And the query while we’re at it.

Mapping the Ontology to a Data Schema

First off, let’s visit a model we already know about, from an earlier blog post. A model that describes the construction of a custom computer out of three kinds of components; mother board, memory and processor. That model (which was also built by ChatGPT) looks like this:

Model built by ChatGPT that describes how a computer is put together. Diagrem from gra.fo

We have that model; so what? We can use it to answer questions about how computers are put together, or we could even use it to design a database that might power part of our customer computer construction business. But a much more common use case is to use it as a reference point in a data mesh. How does that work?

Let’s start small. Suppose we have a database that describes the orders we’ve got. It has four tables, previews of which are shown in the next four figures:

These tables already share a lot of structure with the model; they correspond one-for-one with the classes, they include some columns that cover the information described in the model. But which one is which?

You could imagine a user interface that would allow a well-informed user to provide this information; The computer class in the model corresponds to the computing_configuration table in the database; the identifier for that is found in the customer field. The motherboard field is a foreign key reference to the board table, which corresponds to the motherboard class, and so on. In fact, data.world has exactly such a capability in its gra.fo tool, whereby a data manager can use the model to define mappings of this sort. The results are presented in a standard called R2RML, which is a language for describing mappings like this. R2RML is expressed in RDF, so it can be serialized in Turtle.

Here’s an excerpt of a mapping in R2RML. It isn’t the friendliest representation, but it is pretty straightforward:

<#TriplesMapConfiguration>
a rr:TriplesMap;
rr:logicalTable [ rr:tableName "configuration" ];
rr:subjectMap [
rr:template "http://example.com/computer/{customer}";
rr:class ex:Computer
];
rr:predicateObjectMap [
rr:predicate ex:hasComponent;
rr:objectMap [
rr:template "http://example.com/computer/motherboard/{motherboard}"
]
];

...

<#TriplesMapBoard>
a rr:TriplesMap;
rr:logicalTable [ rr:tableName "board" ];
rr:subjectMap [
rr:template "http://example.com/computer/motherboard/{sku}";
rr:class ex:Motherboard
];

...

This snippet says that we’re going to map the table called “configuration” (that’s the first table I showed, above) to the class in the ontology called Computer (at the top of the model diagram). Furthermore, we’re going to use the column called customer to create IRIs for the instances of that class. It makes a similar statement about the table called “board” and the class Motherboard.

The fun starts when we connect them together; it says that we’ll use the predicate hasComponent to connect the Computer to the Motherboard. Not only does it connect to a member of the class Motherboard, but it shows a template (using the column motherboard) so that the reference matches the URI for the motherboard (using the columns sku). The R2RML goes on in this way, covering all the components. It does something similar with the prices:

    rr:predicateObjectMap [
rr:predicate ex:hasPrice;
rr:objectMap [ rr:column "price"; rr:datatype xsd:decimal ]
]

...

rr:predicateObjectMap [
rr:predicate ex:hasPrice;
rr:objectMap [ rr:column "pricesheet"; rr:datatype xsd:decimal ]
]

...

rr:predicateObjectMap [
rr:predicate ex:hasPrice;
rr:objectMap [ rr:column "offer"; rr:datatype xsd:decimal ]
].

This is or the motherboard; it says that the property hasPrice is mapped to the column price in the board table. There are similar mappings for the columns pricesheet and offer in the other tables. You can see the whole R2RML file for this example here.

Using the mappings

Once these mappings are done, the results are pretty impressive. Suppose you have a very simple business rule:

the price of a computer is given by the sum of the prices of all of its components.

The business rule doesn’t mention what types the components are, or how many there are, or how they are related to the computer, just that they are its components. In the original table configuration, you need to include all three joins (to each of the components) and all three price columns (with names price, offer and pricesheet respectively) in your query. That query might look something like this:

select  customcomputing_configuration.customer, price +offer +pricesheet
FROM customcomputing_configuration
JOIN board_board ON customcomputing_configuration.motherboard = board_board.sku
JOIN memory_memory ON customcomputing_configuration.memory = memory_memory.partid
JOIN processor_processor ON customcomputing_configuration.processor = processor_processor.partnumber

To write this query, you need to know the foreign key names from the configuration table (motherboard, memory, processor) and their corresponding primary keys in the other tables (board.sku, memory.partid, processor.partnumber), and the names of the columns with prices. All of this information was already modeled in the mapping above. How can we make use of that?

The information in the mapping lets the query be very simple:

SELECT ?customer (SUM (?much) AS ?total)

WHERE {

?s ex:hasComponent ?comp ; ex:hasname ?customer .
?comp ex:hasPrice ?much .

}
GROUP BY ?customer

The business rule is written almost verbatim in SPARQL; ?s has a name that is ?customer, and it has some components, which each have prices. Sum up those prices, and you have the total for each customer. The query doesn’t need to mention anything that is specific to each type of component, just as the business rule doesn’t mention those things.

This is the basis of how a knowledge graph can provide an extensible data system (I am tempted to use words like “fabric” and “mesh” for this, but those words have very specific meanings nowadays). A simple model is mapped to various tables, and the relationships between the tables are represented in their own mappings. I didn’t do it in this short example, but there’s no reason you have to map to tables in just one database; you could map the same model to lots of databases, and query them all at once. Once you do that, you can write queries in the language of the model. I once had a customer who made the initial investment to map the tables in the database to the classes in the model, making a sort of leap of faith that when they did that, something magical would happen (that model and tables were quite a bit more elaborate than the ones in this blog). When we wrote the first model-level query, my customer made the comment, “when I started this project, I was dating Knowledge Graph. Now I’m falling in love with Knowledge Graph”. The investment was worth the trouble.

Trouble in Paradise

But there’s a fly in this ointment; my romantic customer was willing to put in the work up front to make the mappings, without any visible value. This violates a principle of innovative deployment that I like to call “incremental value from incremental effort”; you can’t boil the ocean, but if you can show value by boiling a cup of water, you’ll be given the resources to boil the next cup, and the next. You still might never boil the ocean, but you’ll never run out of hot water. In this case, my customer had to do a lot of mapping before the value of writing small, meaningful queries became evident. Not everyone will be willing to make this sort of commitment.

Until now, this was the best we could do. The value of being able to write a terse, business-level query would have to amortize the effort needed to make the mappings. In many cases, this happened; I have been involved in a handful of successful enterprise knowledge graph projects that follow this pattern. But in general, it is a hard sell; the up-front investment of mapping effort is too steep a barrier for entry, limiting enterprise knowledge graphs to being a rich company’s game.

But LLMs could change all that. Since the output of gra.fo’s mapping interface is a standard, ChatGPT knows about it. So I wondered if I could use ChatGPT to automate this important step. The short answer is “yes”, but the details are pretty impressive; read on.

First, I had to familiarize ChatGPT with the model. This model is small enough to fit into a GPT3 prompt, so I asked ChatGPT to summarize it back to me. The prompt was simple:

Summarize the following ontology very briefly in English.

followed by the model itself.

Then I described the tables; the same ones you saw above, and asked for an R2RML mapping:

There are four tables in a database. Their names and columns are as follows:
configuration has columns customer,motherboard,memory,processor
board has columns sku,price
memory has columns partid, supplier,reseller,capacity,offer
processor has columns partnumber,speed,architecture,pricesheet
Please map these to the ontology listed above, and provide the result in R2RML presented using TTL.

I had to do some adjustments; the original R2RML didn’t recognize the foreign keys.

That was good, but do it again, and make sure to use a predicateobectmap [sic] when you refer to a foreign key.

Looking back, I think my prompt was a bit nonsensical, but it apparently figured out what I meant. I wonder if barking “Fix the foreign keys!” would have worked just as well.

After working through the query, I found that it hadn’t figured out that the column “offer” was a price (I did that on purpose; that isn’t at all obvious). So I had to tell it. I didn’t want to get the whole file back, so I told it to just give me that part that changed:

That’s almost right. Can you add in a line for the offer on the memory? That’s it’s price. Just show the triplesmap for the memory, no need to repeat the rest.

The result, which shouldn’t surprise you, is the R2RML file that I described in the second section of this blog.

But does it work? I dropped that R2RML file into data.world, which has a query virtualization layer that use R2RML. The way it works is that if there is a file with the suffix “.r2rml” in scope, it will use it to respond to queries.

SELECT ?customer (SUM (?much) AS ?total)
FROM NAMED :mapped
WHERE {
GRAPH :mapped {
?s ex:hasComponent ?comp ; ex:hasname ?customer .
?comp ex:hasPrice ?much .
}
}
GROUP BY ?customer

The results are just what you’d hope; a list of the customers, along with the cost of their computers, determined by the sum of the costs of the component parts. You can see and run the query here.

Impact

Why am I excited about this result? Let’s review: an enterprise knowledge graph can be built (and usually is) with a variation on a simple pattern; build (or find) a reference ontology, map that ontology onto the schema(s) of one or more of your datasets, express that mapping in an executable way (best is R2RML, since it is a standard), then use a virtualization engine to respond to business-level queries. This allows a query writer to express their needs without knowing the details of any of the original data sources. This is a bit of a “holy grail” for enterprise knowledge graphs; you can express your business rules and report requirements in business language, but have them run against your existing enterprise data.

The tough part of this story has always been those mappings. They have always required a familiarity with both the ontology and the data model. Expressing them in any language is fussy, because of all the details that have to be specified.

In this experiment, ChatGPT has shown that it can fill that gap, with very little extra guidance. This was done using GPT4; there’s any number of reasons to believe that GPT5 will do even better.

Okay, so this is important for Enterprise Knowledge Graphs. But what does it mean for data management in general? Currently, it is common for an enterprise to have thousands or even tens of thousands of databases, each having some impact on the business. Integrating that data is a time-consuming and error-prone task; expensive projects are funded to make it happen. The assumption in a large data setting like this is that it is difficult to bring data together, so we expect it to stay separate.

If the promise shown in this experiment can be made into a scalable technology, this will turn around. The assumption will be that any datasets in the enterprise — or even, in the world — will be integrated in a meaningful way. This is good news for market researchers, scientists, product developers, and anyone who has a Big Data need; they no longer have to include a data harmonization effort into an analytic project. This is bad news for money launderers, who rely on the fact that it is difficult to track their movements as they go from one market to the next, changing names and account numbers. With fluid data integration, there will be nowhere to hide.

Not only will this impact the way data managers, regulators and researchers experience data, but even people who have no professional awareness of data will experience the difference. People who don’t have data experience assume that an enterprise works with data the way they’d like a person to work with data; you tell it something, it remembers it, and uses it whenever it is relevant. If I tell you my husband’s name is Tim, I expect you to remember that, and refer to him by that name, even in a completely different context. But we’ve all experienced enterprise data integration failures; packages sent to the wrong address, even after you’ve made the change in ‘the system’. Incorrect spelling of names that show up just sometimes; it is as if part of the enterprise doesn’t have access to the data another part has. If the data landscape changes in the way that this experiment suggests, enterprises will behave the way we already expect them to do; they will have their data integrated throughout, as if by magic.

abstract image of many small cubes connected together

Are there risks associated with this sort of smooth semantic data integration? Of course there are; any seriously advanced technology will have risks. Will bad actors have more access to data than they did before? Yes, they will. But good actors will have the same access; just as in the money laundering example above, it will be harder to hide. Smooth integration might actually make the world worse for bad actors.

Epilogue

You’ve stuck with this blog this far — bravo! But as I was finishing this up, I wondered, why on earth did I have to write the SPARQL query myself? Can’t ChatGPT do it?

I even decided to do it in the style of this blog — I gave ChatGPT my business rule, exactly as expressed above, and told it to write a SPARQL CONSTRUCT query that would compute the price. It actually had a false start, where it made a query that was too complex (but functionally correct). The query it came up with after my prodding was this:

CONSTRUCT {
?computer ex:hasPrice ?totalPrice .
}
WHERE {
{
SELECT ?computer (SUM(?price) as ?totalPrice)
WHERE {
?computer a ex:Computer .
?computer ex:hasComponent ?component .
?component ex:hasPrice ?price .
}
GROUP BY ?computer
}
}

Sure looks good to me.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.