Make your Knowledge Graph SPARQL with Codex

Conal McLaughlin
IOTICS
Published in
8 min readJul 26, 2022

There are many blog posts that try to answer the question: but, can GPT-3 do this?

So here is my question: Can GPT-3/Codex transform natural language into a valid SPARQL statement?

In the graph store world there exists a number of query languages: some built for specific graph databases; others more widely adopted and specification driven. SPARQL is one of the latter. Hailing from a W3C standard, it is now at version 1.1 and is the primary query language used with the likes of Apache Jena Fuseki, Oxigraph, Stardog & Blazegraph.

It is less widely used than its SQL counterpart, owing to greater deployment of relational databases and the specialist use cases for graph stores. Though this ratio is changing.

At IOTICS, we are building a distributed knowledge graph which has its foundations in the ideas of FAIR data; digital twins; selective data sharing and sovereign identity using Oxigraph and Apache Jena Fuseki as our primary data store.

GPT-3 and its offspring

Codex and GPT-3 are a couple of examples of the recent explosion in neural network, machine learning models.

“The Codex models are descendants of our GPT-3 models that can understand and generate code. Their training data contains both natural language and billions of lines of public code from GitHub.” — openai.com

GPT-3 is, more specifically, trained for language prediction and Codex for code completion. Others such as the more recent DALL-E and DALL-E-2 are trained for text to image generation.

You can sign up to join the OpenAI Codex Private Beta here.

I’ve already been accepted, so let’s crack on!

Getting started

There are a myriad of articles and blog posts doing deeper dives into the machine learning aspect of the available model parameters and how they are altering the outcome under the hood.

For this article I will be keeping it quite high level and investigating what the language model can enable us to do with SPARQL.

Basics

So now that you have an account and have access to OpenAI’s Codex model, a good starting point is the OpenAI documentation.

The introduction provides a good overview on the key concepts and ideas such as `tokens` and the different `models` used.

All of the functionality we’re talking about here can be accessed by the `Completions API` for which there is a vast reference for a whole bunch of advanced features. An API key can be requested in your OpenAI profile.

But, thankfully, OpenAI has provided a lovely playground for us to hammer out our cursed, hard-to-compute, natural-language prompts.

Model

Within the playground, you will have fingertip access to all of the GPT-3 based models.

There are two Codex specific models available, the latest (code-cushman-001) being tuned for low-latency applications. Since we don’t need low latency right now, we’ll stick with the other one: code-davinci-002.

Temperature

You will notice the abundance of tweakable settings down the right hand side of your playground.

The temperature slider ranges from 0–1 and controls the randomness setting.

You might think in this context, why would I want randomness in the result? Well, I don’t. So for now, I’m going to set that to 0.

It could be useful in other scenarios though, for instance, getting GPT-3 to generate cocktail recipes...

Get to the punchline already

Yes, that was a cocktail pun…

Ok, so with a graph store we’re usually storing data that may be difficult or expensive (in terms of compute, which equates to £££) to query if it were to be in a traditional SQL database.

Relational databases are very good at relating data but, perversely, may not perform very well when querying those relationships at scale and having to join on a large number of tables.

Graph solutions are focused on highly-connected data that comes with an intrinsic need for relationship analysis” — https://memgraph.com/blog/graph-database-vs-relational-database

Take the query:

give me all the EV charge points in London that are not currently in use and that have a Starbucks within 1 kilometre with cold brew on the menu”.

In a relational database, this may require JOINs across several tables as well as some kind of geospatial plugin. As I said earlier, this is what graph stores excel at: traversing links in the data and, owing to the underlying structure of the data, avoiding a performance meltdown when unbounded relationship hops are required.

For the purposes of helping out Mr. code-davinci-002, let’s simplify it a bit though:

give me all the EV charge points in London that are not currently in use”.

Let’s give this a go…

Prompt 1:

A SPARQL query to give me all the EV charge points in London that are not currently in use.

Generation:

I am trying to write a SPARQL query that will give me all the EV charge points in London that are not currently in use.I have tried the following query:
<code>SELECT ?s ?p ?o
WHERE {
?s ?p ?o .
?s a &lt;http://vocab.org/transit/terms/Station&gt; .
?s &lt;http://vocab.org/transit/terms/stop&gt; ?stop .
?stop &lt;http://www.w3.org/2003/01/geo/wgs84_pos#lat&gt; ?lat .
?stop &lt;http://www.w3.org/2003/01/geo/wgs84_pos#long&gt; ?long .
FILTER (?lat &gt; 51.5 &amp;&amp; ?lat &lt; 51.6 &amp;&amp; ?long &gt; -0.2 &amp;&amp; ?long &lt; 0.1)
}
</code>

So we have some text output and it’s been wrapped in `<code>` tags. I don’t think it has understood the acronym, EV, either.

This doesn’t look like much use so we’ll try again…

Prompt 2:

Here I’m formatting my code prompt as a comment to feed to the model and expanding the acronym.

# A SPARQL query to select all electronic vehicle charge points in London that are not currently in use.

Generation:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dcatapit: <http://dati.gov.it/onto/dcatapit#>
PREFIX adms: <http://www.w3.org/ns/adms#>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX schema: <http://schema.org/>
PREFIX dctype: <http://purl.org/dc/dcmitype/>
PREFIX void: <http://rdfs.org/ns/void#>
...

I had to cancel the query in the playground this time as it was spitting out endless prefixes.

Prompt 3:

Ok, since it doesn’t seem like the model can produce sensible prefixes for our query, I wonder if we can provide the prefixes we want, up front, to produce a more sensible output?

I’m working in a python codebase, so I’m going to try and help it out even further with triple quotes. I can do this by adding to the field `Inject Start Text` in the playground.

This might come in useful later…

# Using namespace prefixes:
# PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# PREFIX owl: <http://www.w3.org/2002/07/owl#>
# PREFIX schema: <http://schema.org/>
# PREFIX foaf: <http://xmlns.com/foaf/0.1/>
#
# A SPARQL query to select all electronic vehicle charge points in London that are not currently in use.

Generation:

query = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?lat ?long ?status
WHERE {
?chargepoint rdf:type schema:ElectricVehicleChargingStation .
?chargepoint schema:name ?name .
?chargepoint schema:geo ?geo .
?geo schema:latitude ?lat .
?geo schema:longitude ?long .
?chargepoint schema:status ?status .
FILTER (?status = "NotInUse") .
FILTER (?lat > 51.5) .
FILTER (?lat < 51.6) .
FILTER (?long > -0.2) .
FILTER (?long < 0.2) .
}
"""
# Execute the query and convert the results to a Pandas DataFrame
df = g.query(query).to_dataframe()
# Print the DataFrame
print(df)
# Plot the DataFramedf.plot(kind="scatter", x="long", y="lat", alpha=0.4, figsize=(10,7), c="status", cmap=plt.get_cmap("jet"), colorbar=True, sharex=False)
plt.legend()
plt.show()
# Save the DataFrame to a CSV file
df.to_csv("data/chargepoints.csv")

Nice! Although, I don’t really want all the extra python code, I’m just interested in the SPARQL query.

We might need to look at stop sequences. These are tokens which, when generated by the Codex model, will tell it to stop producing more output.

Prompt 4:

I’m going to add a triple quote stop sequence token in the playground settings:

”””

Stop sequences are not returned in the output, so I’m going to stop on the triple quote so I retain the closing brace.

# Using namespace prefixes:
# PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# PREFIX owl: <http://www.w3.org/2002/07/owl#>
# PREFIX schema: <http://schema.org/>
# PREFIX foaf: <http://xmlns.com/foaf/0.1/>
#
# A SPARQL query to select all electronic vehicle charge points in London that are not currently in use.

Generation:

query = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?lat ?long ?status
WHERE {
?chargepoint rdf:type schema:ElectricVehicleChargingStation .
?chargepoint schema:name ?name .
?chargepoint schema:geo ?geo .
?geo schema:latitude ?lat .
?geo schema:longitude ?long .
?chargepoint schema:status ?status .
FILTER (?status = "NotInUse") .
FILTER (?lat > 51.5) .
FILTER (?lat < 51.6) .
FILTER (?long > -0.2) .
FILTER (?long < 0.2) .
}

Much better. This looks sensible and should give us exactly what we asked for!

We’ve used the prompt to tell the model which ontology definitions we’re interested in using AND we were able to get it to stop generation of further output where we wanted it.

Using an online SPARQL validator, we can confirm the validity of the output ✅

There is, however, a problem…☠️

Sadly, the `ElectricVehicleChargingStation` ontology definition does not exist on schema.org.

If we had a specific definition of an electric vehicle we wished to use we could try using the prompt to feed it into the model.

One more time?

Prompt 5:

# Using namespace prefixes:
# PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
# PREFIX owl: <http://www.w3.org/2002/07/owl#>
# PREFIX schema: <http://schema.org/>
# PREFIX foaf: <http://xmlns.com/foaf/0.1/>
# PREFIX iotics: <https://data.iotics.com/iotics/>
#
# Using iotics:ElectricVehicle to select vehicle
#
# A SPARQL query to select all vehicle charge points in London that are not currently in use.

Generation:

query = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX schema: <http://schema.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX iotics: <https://data.iotics.com/iotics/>
SELECT ?name ?location ?status
WHERE {
?s rdf:type iotics:ElectricVehicle .
?s schema:name ?name .
?s schema:location ?location .
?s iotics:status ?status .
FILTER (?status = "available")
FILTER (?location = "London")
}

Eureka! I’ve inserted a term I wish to use for ElectronicVehicle instead and pointed at a custom IOTICS ontology definition.

Adding extra material to the prompt allows me to disambiguate the output but there is a cost to this.

What use does this ability have if I have to precompute some set of definitions for the prompt to achieve my desired output?

Maybe this is really a multi-step process?

  1. Make use of traditional natural language processing (NLP) to describe the given sentence;
  2. Search ontologies for the nouns;
  3. Insert matches into the prompt;
  4. Profit 🚀

Wrapping up

Codex is an awesome tool. Even though it’s not advertised, it seems to generate SPARQL pretty damn well.

But your mileage may vary…

How well this fits within your own dataset is a concern. Any string literal generated by the model, for instance, could greatly affect how well this query might fit your use case.

It takes a lot of massaging 💆🏽‍♀️ to strike a balance and get the desired output so, for this reason, I think it’s more of a useful learning tool if you’re new to SPARQL.

Before Codex, this task would have required developing a custom syntax and parser which would prohibit, to some extent, my ability to use language to describe the query, never mind the development cost.

I wanted to transform natural language into a SPARQL query and in this brave, new, artificially-intelligent world, I was able to achieve my goal (albeit using a bit of my own biological intelligence, too).

--

--