Like Sugar and Salt

Dean Allemang
6 min readMar 30, 2023

When I started this blog, I had a bunch of topics in mind about Semantic Web, RDF/SPARQL, OWL, semantic modeling, etc. I had planned on trotting one out each week and writing about it. I have quite a backlog of topics, so I figured I’d be good for a while.

But nowadays, it seems that if I don’t talk about ChatGPT or LLM or some thing about the AI revolution that is going on, then it just can’t be topical. So for this week’s blog, I have decided to do both; take one of my backlog topics, but talk about how ChatGPT changes the landscape.

I’m going to introduce this topic with a story; it was the early days of the Semantic Web (probably something like 2003). The company I was working for at the time (TopQuadrant) had scored a consulting gig with the US government to build an ontology from the Federal Enterprise Architecture Reference Model. A lot has changed in the FEA since those days, but back then, as now, there were five component models; Business, Performance, Service, Data and Technical. Our client, in the GSA, had contracted us to deliver this ontology as an RDF file. But in our work, we saw that the five models were specified differently, were governed differently, and we processed them differently. So we produced not one, but five RDF files, one for each model.

It was the early days of RDF, but we already understood that it was an easy matter to merge RDF files together, like you can see in the animations in an earlier blog. What isn’t so easy is to take a single file with a bunch of RDF triples, and split it into several files. You effectively have to decide where each triple goes; in principle, one micro-decision for each triple. Hence the title of this blog; just as it is easy to mix salt and sugar together, and it is theoretically possible to separate them out again, grain by grain, it’s not an easy task. It’s a much better plan to store your salt and sugar separately, and mix them just as needed.

someone drawing a crystaline powder from a pile into a small vessel
Picking salt from a pile of mixed salt and sugar is a painstaking task.

That’s what we had done with the FEA; we had five RDF files, which were governed separately, and which we could combine as needed. But our customer was adamant; he had paid for one RDF file, not five. He wanted his RDF file!

Not only was this story early in the days of RDF, it was early enough in the days of Java that there was a new version of Java out every few months. It was not uncommon for software to be delivered with its own Java runtime environment, because it was tricky to know which versions your software was compatible with. So while Jena was already around in those days, it wasn’t the easy install it is today; so we didn’t all just have it on our desktops. Of course, it would be a simple command line invocation in Jena to combine these five files into one, but I didn’t have Jena at my fingertips. Instead, I had a system called RDF Gateway, which really was the quickest way to Hello Semantic Web in those days (at least, on Windows).

So while my boss tried to convince the customer that he really wanted five RDF files, not just one, I wrote a little six-line script in RDF Gateway; read, read, read, read, read, write. As my boss was struggling with her plea, I interrupted her, and instructed the customer to check his email for the single RDF file he asked for. End of discussion.

The point of this story is to bring home why it is common to separate an ontology into several pieces; what I have come to call an Ontology Architecture. Just as it is common practice to separate executable code into modules that are combined at build time, it is best practice to modularize ontologies. It is easier to combine them than it is to separate them out.

But sometimes you go into a project where this foresight wasn’t present from the start. You’ll find that someone has delivered a single RDF file with several ontologies in it. Maybe they were accustomed to customers like the one we had. But what you want, for future development, maintenance and governance, is an Ontology Architecture; several ontologies with dependencies expressed in the triples themselves (usually through owl:imports triples). So you are faced with the task of separating the sugar from the salt.

TopQuadrant had some excellent tools for doing this; you still have to pore over all the files and effectively consider where every triple belongs, but you can do big chunks of the work wholesale; it gives you a UI for sorting out the triples you need, and working with them as a group. It is still quite a tedious task that takes a good deal of concentration and expertise. I’m not sure if I am proud or embarrassed to say that I am pretty good at it.

But it seemed to me that this is exactly the sort of task that ChatGPT might be good at. Give it the single ontology file, and ask it to pick out all the triples that have to do with a certain namespace. Then do the same for each namespace. Bonus points if it recognizes that some of these ontologies are versions of standard ones. So I gave it a try.

I started off with an ontology that I had to factor a long time ago; it wasn’t very big; about 800 lines of Turtle. I tried sending that in a prompt to ChatGPT (using GPT-4). It spat it back out at me for being too large. This is a problem that I expect will be ubiquitous with GPT applications; how to mange its attention when it can only manage prompts of a limited size.

The method I tried was to turn it around. I started by describing the task I wanted to do; to filter out all the triples except the ones that pertained to a particular ontology. I was lucky in that the originator of this file had at least managed namespaces well; you could tell which ontology a resource belonged with by its IRI. Then I told it that I was going to feed it the file piece by piece.

Since this filtering doesn’t really require knowledge of the whole ontology to figure out which bin a particular definition belongs to, streaming like this should work fine. And indeed, it did, for the first two tranches. ChatGPT was able to produce TTL that satisfied the requirements.

But at the third tranche, it forgot the instructions. I guess they were too far back in its prompt history. So I repeated them, and tried to pick up where I left off.

Then I ran into the issue that for one of the tranches, the output was too big. It ran into a network error before finishing. So I had to make smaller tranches. Then, to get over the instruction forgetting part, I repeated the instructions (which I managed to make pretty terse) at the start of each prompt.

This eventually got me what I needed; my first filtered file according to a single ontology.

But there were still some flies in the ointment. Quick inspection showed that one of the resources that should have been included was missing. I asked ChatGPT to include it, which it cheerfully did. But that left me with little confidence in the completeness of my result.

To finish the ontology architecture, I would have to do this for each ontology embedded in the original file. Then I guess I could check that every triple was accounted for, and none accounted for twice. To its credit, ChatGPT was able to identify a list of the ontologies in the original file. Except it missed one.

I’m afraid I have to declare ChatGPT’s foray into this problem a failure. The amount of effort it takes to get it to do what I need is not so different from what I would expend doing this with a good ontology management tool like TopBraid Composer. And the confidence I would have in the result would be higher.

I guess none of this comes as a surprise; I turned ChatGPT loose on a task that requires attention to detail and concentration, and it did a great job, or at least it said it did. But closer inspection turned up the gaps. At least it didn’t hallucinate; I didn’t find any new resources that weren’t in the original at all. But the results were not ready for prime time; it left things out, it needed constant reminding, and managing the prompts was more trouble than the original task.

I guess this part of my job is still secure.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.