Linking data with ChatGPT

Dean Allemang
10 min readApr 14, 2023

Somewhere in my browsing today, I saw a post by Kingsley Uyi Idehen where he mentioned that the language capabilities of LLMs is making it a lot easier to produce semantic data. (Sorry, Kinsgley, I think I was on the road when I read that — if you have the link, please provide it and I’ll update this). I think this is true in more ways than he or I can imagine. Here’s one that I’ve been playing with.

I’m going to start of with a pretty cool way to form linked data from other data sources that has been around for a while, but then I’m going to show how to do it with ChatGPT.

“More like this”

I’m going to illustrate this with what is probably the most popular way to send data messages today, i.e., JSON. Here’s some JSON that I got from the FHIR standard, that describes a little tiny controlled vocabulary about “administrative gender”:

{
"group" : [{
"source" : "http://hl7.org/fhir/administrative-gender",
"target" : "http://terminology.hl7.org/CodeSystem/v2-0001",
"element" : [{
"code" : "male",
"target" : [{
"code" : "M",
"equivalence" : "equivalent"
}]
},
{
"code" : "female",
"target" : [{
"code" : "F",
"equivalence" : "equivalent"
}]
},
{
"code" : "other",
"target" : [{
"code" : "A",
"equivalence" : "narrower"
},
{
"code" : "O",
"equivalence" : "narrower"
}]
},
{
"code" : "unknown",
"target" : [{
"code" : "U",
"equivalence" : "equivalent"
}]
}]
}]
}

fig. 1. Some JSON in FHIR that describes a small controlled vocabulary for gender.

This structure says that we recognize four genders (administratively); they are called “male”, which has an equivalent form “M”; “female” which has equivalent form “F”; “other” which has two narrower forms, “A” and “O”; and “unknown”, which has equivalent form “U”. How do I know this? I just made it up, by looking at this structure and using my background in managing controlled vocabularies, and made an educated guess. I could also look up some documentation in FHIR that tells me these things.

If we want to process this data, we have a lot of options. If we keep it as JSON, we can write programs in python, javascript, and lots of other programming languages. We can use GraphQL or jq or other means to process the JSON directly.

But for those of us who like graph query languages, something we sometimes do is to convert this into a graph. Let’s do this with just a tiny piece of this JSON document:

{
"code" : "male",
"target" : [{
"code" : "M",
"equivalence" : "equivalent"
}]
}

We can show the graph representation of this in RDF (in the form of TTL). There isn’t really a standard way to do this, but data.world has its way, which I have abbreviated here:

[ rdf:type  dwj:Object ;
j:code "male" ;
j:target ( [ rdf:type dwj:Object ;
j:code "M" ;
j:equivalence "equivalent"
]
)
]

This represents the JSON document with structural fidelity; there is an entity in the TTL of type Object that corresponds to the Object in the JSON; there are properties that correspond to the fields in the objects. When there are arrays in the JSON, those are rendered as Lists in TTL (those are the parentheses).

Here’s where “more like this” comes in. Suppose I want to find sturctures like the one shown here; I want to fine “more like this”. That is, we want to pluck out patterns from this data of the form “‘male’ is ‘equivalent’ to ‘M’” We see those three things right in the document; I’ve highlighted them in bold. To turn this into a “more like this” query, we just replace each of the bold literals with the variable we want to call that value, and put an appropriate SELECT around it:

SELECT ?mainCode ?relation ?subCode 
WHERE {
[ rdf:type dwj:Object ;
j:code ?mainCode ;
j:target ( [ rdf:type dwj:Object ;
j:code ?subCode ;
j:equivalence ?relation
]
)
]
}

You can run the query here. There, I’m running it against the whole of the FHIR concept maps document, so there are hundreds of outputs. The lines that correspond to the input I quoted here are

You might notice that there is one (or two, depending on how you count) missing; the concept “other” and its two narrower terms, “O” and “A”. The reason for this has to do with the fact that “other” isn’t quite like “male”; it has two related terms, whereas the query is only looking for lists of length one. This is an example where just changing literals to variables isn’t quite what we mean by “like this”; what you really want is any member of that list, no matter how far into the list it is. There’s a way to adjust the query to account for this (and it’s pretty simple), but I’ll talk more about that when we get to the ChatGPT answer.

Listing out the concepts and their related concepts is useful for finding our way around the vocabulary, but a vocabulary like this is naturally a graph structure. We might want to represent it as such, by building it into a SKOS structure. We don’t have to change the pattern in the query, but we do have to figure out what the semantic relations are. The possible values in FHIR are “equivalent”, “narrower” and “wider”; these probably correspond to the SKOS semantic relations skos:exactMatch, skos:narrower and skos:broader respectively. We can code this into a CONSTRUCT query as follows:

CONSTRUCT {
[] a skos:Concept;
skos:prefLabel ?mainCode;
?rel [ a skos:Concept;
skos:prefLabel ?subCode
]
}
WHERE {
[ rdf:type dwj:Object;
j:code ?mainCode;
j:target (
[ rdf:type dwj:Object;
j:code ?subCode;
j:equivalence ?relation
]
)
]
BIND (
IF (?relation = "equivalent", skos:exactMatch,
IF (?relation = "narrower", skos:narrower, skos:broader)
) AS ?rel
)
}

I’ve chosen to use blank nodes for the SKOS concepts. That’s a pretty unusual thing to do, but I don’t want to spend this whole blog on refining this query; I’m going to let ChatGPT do that.

Using an LLM to generalize data

Writing a “more like this” query is pretty easy for a human to do in TTL/SPARQL, because the languages are so similar in structure. So this should be an easy task for ChatGPT to do; but do we get some advantages when we do that?

I started off by giving ChatGPT the full TTL expansion of the FHIR JSON shown in fig. 1.

[
rdf:type dwj:Object;
j:group (
[
rdf:type dwj:Object;
j:element (
[
rdf:type dwj:Object;
j:code "male";
j:target (
[
rdf:type dwj:Object;
j:code "M";
j:equivalence "equivalent";
]
);
]
);
j:source "http://hl7.org/fhir/administrative-gender";
j:target "http://terminology.hl7.org/CodeSystem/v2-0001";
]
);
]

fig. 2. The data from fig. 1 represented in TTL, according to the conversion used by data.world.

I’m just showing the part for “M”, because that’s enough to show what I mean by “more like this”. I started out by giving this structure to ChatGPT 4, and asked it to do the same generalization we did above, that is, to pick out the main code, the related code, and the name of the type of the equivalence. I used those words; there’s nothing in the input to say what I mean by “main code” or “related code”. The only word I used that shows up in the input is “equivalence”. It’s first effort is interesting for a few reasons.

  ?mainElement rdf:type j:Object ;
d:element ?mainCodeObject ;
j:containedIn :file-gender.json .

?mainCodeObject rdf:type j:Object ;
d:code ?mainCode ;
d:target ?relatedCodeObject .

?relatedCodeObject rdf:type j:Object ;
d:code ?relatedCode ;
d:equivalence ?equivalenceType .

First off, it didn’t use the original structure of the input (so it didn’t just substitute variables, like I did). But second, it simply avoided the issue of lists entirely; the query won’t match any of the terms in the original input, including the example term (“male”).

I told it to take lists into account. It fixed it, including the issue of finding any element in the list:

  ?mainElement rdf:type j:Object ;
d:element/rdf:rest*/rdf:first ?mainCodeObject .

?mainCodeObject rdf:type j:Object ;
d:code ?mainCode ;
d:target/rdf:rest*/rdf:first ?relatedCodeObject .

?relatedCodeObject rdf:type j:Object ;
d:code ?relatedCode ;
d:equivalence ?equivalenceType .

This is the solution that we can’t get by just substituting values. It doesn’t use the ( ) notation for lists at all; ChatGPT broke the list into rdf:first and rdf:rest. This allowed it to put a Kleene star (*) on the rdf:rest, which lets the query search into the whole list. This is the part that I said I would talk about later; it’s a bit difficult to explain why this is necessary in the substitution solution, but ChatGPT got it right, as soon as I told it about lists.

I thought it would be fun to tell ChatGPT to draw some pictures of this. I have heard of a program called cytoscape for laying out graph displays. I’ve always wanted to learn how to use it, but it has never been a priority for me. I still don’t know how to use it, but ChatGPT does. So I asked it to write a SPARQL query to turn the JSON (represented in TTL) into Cytoscape.¹ After some format tweaking (I still don’t know anything about Cytospace formatting), I got some nice diagrams that represent the clusters of terms in the original FHIR JSON. Here are some images from them:

“female” is equivalent to “F”
“other” has narrower terms “O”,“A” and “UN”. “O” is equivalent to “work”, which is equivalent to “WP” and “WPN”.

These are nice diagrams, but they don’t really make a lot of sense. Why is “work” connected to “other”, when “other” refers to an administrative gender? The problem here is that this conversion is just using the codes to identify the nodes. The code “O” is used in a number of places, and the diagram links them all together.

This is a classic case of “strings” vs “things”; the letter “O” is a string, not a thing. But if we look at the original FHIR document, we see that there is context information to disambiguate these things. Let’s use this to have ChatGPT generate a SKOS representation, and use this context to coin IRIs for the “things” that are the concepts. Here is the query that ChatGPT came up with:

CONSTRUCT {
?mainConcept a skos:Concept ;
?skosRelation ?relatedConcept ;
skos:prefLabel ?mainCode .
?relatedConcept a skos:Concept ;
skos:prefLabel ?relatedCode .
}
WHERE {
?object rdf:type j:Object ;
d:group/rdf:rest*/rdf:first ?groupObject .

?groupObject rdf:type j:Object ;
d:element/rdf:rest*/rdf:first ?mainCodeObject ;
d:target ?baseURI .

?mainCodeObject rdf:type j:Object ;
d:code ?mainCode ;
d:target/rdf:rest*/rdf:first ?relatedCodeObject .

?relatedCodeObject rdf:type j:Object ;
d:code ?relatedCode ;
d:equivalence ?equivalenceType .

BIND (IRI(CONCAT(?baseURI, "/", ?mainCode)) AS ?mainConcept)
BIND (IRI(CONCAT(?baseURI, "/", ?relatedCode)) AS ?relatedConcept)

FILTER(?equivalenceType != "")
BIND (IF(?equivalenceType = "equivalent", skos:exactMatch,
IF(?equivalenceType = "wider", skos:broader, skos:narrower)) AS ?skosRelation)
}

There are a lot of cool things going on here. I told it where to find the base URI (in the “target” field), but not what to call it. It figured out how to coin the IRIs (it got it wrong the first time; it tried to find another base IRI for the related object; I told it that they would be the same). But more interestingly, I didn’t tell it anything about the SKOS semantic relations; it figured out the correspondence on its own (I did tell it the possible values from FHIR, since it isn’t looking at the data).

The result is a plain old SKOS file. So I asked ChatGPT to write a query to turn SKOS into cytoscape.² Here are some images from that:

“male” is an exact match for “M”
other has two narrower terms, “A” and “O”
“work” has an exact match of “O”. Not the same “O” as for “other” in gender.

Conclusions

There are a number of things to take away from this blog.

  1. You can write some pretty complicated SPARQL queries without having to think really hard; just take some sample of the data, and generalize it by replacing constants with variables. This works for constructing new things, too; you can draw out the source pattern and the target pattern, using the same variable names, and SPARQL will convert your input to the output.
  2. You don’t have to write complicated SPARQL if you don’t want to. You can just have ChatGPT do it for you.
  3. You can have ChatGPT tell cytoscape how to draw pictures, but it’s still up to you to understand things vs. strings; if you just use the strings to identify the nodes, you can get garbage.
  4. SKOS (and other linked data standards) can really help with things vs. strings. The IRI in the SKOS representation includes the context needed to disambiguate one entity from the other.

This last point is an example of the dynamic I described in Figuring Out vs Telling; the FHIR format gives us enough information in the context to figure out which “O” belongs where (and that they are different!), but SKOS lets us just tell our data consumers that, without making them figure it out again. This helps make the data more FAIR; it can interoperate more easily, without making someone figure it out.

In this example, we had ChatGPT figure it out, and then use SKOS to tell the next user (who might also be ChatGPT!) what we learned.

  1. I’m not going to dissect that query, but if you want to see it, it’s called JSON2Cytoscape.
  2. I’m not going to dissect that query; but it’s called Skos2Cytoscape

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.