Using OntoRefine to Transform Tabular Data into Linked Data
A Quick Introductory Tutorial to OntoRefine in GraphDB
A few months ago I wrote a tutorial on transforming tabular data into linked data using OpenRefine with an RDF plugin. For users that like interfaces I would recommend that method but, if you are comfortable with SPARQL queries, here is a more query based approach.
I have chosen a more simple dataset for this tutorial that you can download from Kaggle here. The final transformed dataset can be downloaded from my GitHub repository.
Getting Started:
OntoRefine is a data transformation tool, based on OpenRefine, that is integrated into GraphDB by OntoText. Hovering over the info button you can read a little introduction to OntoRefine.
To begin we spin up an instance of GraphDB and head to the Import > Tabular (OntoRefine) tab.
On this screen you can import your data from your computer, a URL or your clipboard. Assuming you have downloaded the CSV from Kaggle, choose the file from your computer.
If you have any experience with OpenRefine, it is instantly clear that OntoRefine is based on it and you should have no problem manipulating your data.
This dataset is separated by commas and the first row contains column headings so we do not need to change the default settings. If transforming a different dataset you can edit these settings as required before changing the project name and clicking “Create Project”.
Again, if there is any functionality that I do not cover in this article that you question then it will likely be in the OpenRefine documentation.
Cleaning the Data:
In this screen you can manipulate your data as you need before loading into your graph. If you click “SPARQL” in the top-right of the screen then you will be brought to a screen from which you can query your dataset.
From this query panel you can explore the graph representation of your tabular data. We will create a CONSTRUCT
query to design the graph we want to store in our triplestore but first we need to manipulate the data a little.
First, for ease I have renamed some columns so that none of them contain spaces.
This is not needed but I like to do it for ease in querying later.
This dataset contains dashes in place of missing string data and “-90.0” in place of missing numerical data. Investigating, this dataset was scraped from the Superhero Database and data is missing there. For example, here is Abraxas.
To prepare these gaps for our final graph, we should replace these values. I therefore replace dashes with “unknown” so that superheroes are connected to unknown entities when a piece of information is unknown.
First select a column that you wish to transform as above.
Then type in a Google Refine Expression Language (GREL) expression to edit that column. In this case we want to replace dashes with the string “unknown” so use the following GREL expression:
value.replace("-", "unknown")
With the numerical columns I have replaced -99.0 with 0.0:
value.replace("-99.0", "0.0")
Once this is done for each column, we are ready to create our CONSTRUCT
query. In case you made a mistake at any point, you can see what you have done and undo steps with the “Undo / Redo” tab on the left of the data.
Constructing our Knowledge Graph:
Switch back to the SPARQL panel by clicking “SPARQL” in the top-right of the interface.
I use a CONSTRUCT query here. If you are new to SPARQL entirely then I recommend reading my tutorial on Constructing SPARQL Queries first. I then wrote a second tutorial, which covers constructs, called Constructing More Advanced SPARQL Queries for those that need.
I can’t show the whole query in the screenshot but this is where you should put the following query:
Here is the full query from the screenshot above (I’ll go through it below):
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX spif: <http://spinrdf.org/spif#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>CONSTRUCT {
?hero rdfs:label ?name ;
foaf:gender ?gender ;
dbo:eyeColor ?eyecolor ;
dbp:species ?raceiri ;
dbo:hairColor ?hair ;
dbo:height ?height ;
dbp:publisher ?publisheriri ;
dbp:skinColor ?skiniri ;
dbp:hero ?gvb ;
dbo:weight ?weight .
?raceiri rdfs:label ?race .
?publisheriri rdfs:label ?publisher .
?skiniri rdfs:label ?skin .} WHERE {
?c <urn:col:name> ?nameprep ;
<urn:col:Gender> ?genderprep ;
<urn:col:Eyecolor> ?eyecolorprep ;
<urn:col:Race> ?raceprep ;
<urn:col:Haircolor> ?hairprep ;
<urn:col:Height> ?heightprep ;
<urn:col:Publisher> ?publisherprep ;
<urn:col:Skincolor> ?skinprep ;
<urn:col:Alignment> ?gvbprep ;
<urn:col:Weight> ?weightprep .
BIND(STRLANG(?nameprep, “en”) AS ?name)
BIND(STRLANG(?genderprep, “en”) AS ?gender)
BIND(STRLANG(?eyecolorprep, “en”) AS ?eyecolor)
BIND(STRLANG(?raceprep, “en”) AS ?race)
BIND(STRLANG(?hairprep, “en”) AS ?hair)
BIND(xsd:decimal(?heightprep) AS ?height)
BIND(STRLANG(?publisherprep, “en”) AS ?publisher)
BIND(STRLANG(?skinprep, “en”) AS ?skin)
BIND(STRLANG(?gvbprep, “en”) AS ?gvb)
BIND(xsd:decimal(?weightprep) AS ?weight)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/superhero/”, spif:encodeURL(?nameprep), “/”, spif:encodeURL(?publisherprep))) AS ?hero)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/race/", spif:encodeURL(?race))) AS ?raceiri)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/publisher/", spif:encodeURL(?publisher))) AS ?publisheriri)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/skin/", spif:encodeURL(?skin))) AS ?skiniri)
}
The query of course begins by defining the prefixes used within the query and is then followed by the CONSTRUCT
clause. Within the CONSTRUCT
braces I have described the desired knowledge graph. I have kept this knowledge graph relatively simple for the purposes of this article but you may wish to add rdf:type
statements for example.
Following the CONSTRUCT
clause, we have the WHERE
clause which extracts and transforms the tabular data into the variables in the CONSTRUCT
. In other words, we want to assign values to the variable names we defined in the CONSTRUCT
.
The first ten lines of the WHERE
clause matches the initial graph representation of our tabular data (in which each of the variables contains a column of the original data). The column stored in that variable is selected with the predicate and I have named the variable appropriately with “prep” appended so that I can further manipulate the column before adding the data to my final graph. For example:
?c <urn:col:name> ?nameprep
assigns the column with the heading “name” to the variable ?nameprep
as “name” is in the predicate <urn:col:name>
. This is the same for all of the columns in the original csv.
I next BIND
manipulated ?exampleprep
variables to their final forms. For the string columns I add language tags and for the numerical columns I add datatypes.
For example with the name column, the line:
BIND(STRLANG(?nameprep, "en") AS ?name)
adds the English (en
) language tag to the strings in ?nameprep
and stores them in ?name
. For example "Ant Man”
becomes "Ant Man"@en
.
The line:
BIND(xsd:decimal(?heightprep) AS ?height)
converts the string form of each height to a decimal. For example "183.0"
becomes "183.0"^^<xsd:decimal>
.
If you turn your attention back to the CONSTRUCT
section of the query you will notice that we have most of the variables prepared. We just need the ?hero
and ?exampleiri
entities to complete the desired knowledge graph.
The final four BIND
statements get us exactly what we need. These simply encode the strings, add them to a base URI and convert them to entities. The most complex of these is the ?hero
creation:
BIND(IRI(CONCAT("wallscope.co.uk/resource/superhero/", spif:encodeURL(?nameprep), "/", spif:encodeURL(?publisherprep))) AS ?hero)
This is only the most complex as we are creating each superhero entity with two columns. Some superheroes from different publishers have the same names so to ensure each superhero entity has a unique URI, we must bring the publisher name into it.
Working from the inside out, we apply a SPIN function to both the name and publisher columns. This particular SPIN function URL encodes a string, for example:
spif:encodeURL(?nameprep)
andspif:encodeURL(?pubisherprep)
transforms “Daphne Powell” and “ABC Studios” to "Daphne+Powell"
and "ABC+Studios”
respectively.
Next we use the CONCATENATE
function to join these URL encoded strings together with the base URI so:
CONCAT("http://wallscope.co.uk/resource/superhero/, "Daphne+Powell", "/", "ABC+Studios")
returns the string:
"http://wallscope.co.uk/resource/superhero/Daphne+Powell/ABC+Studios"
Following this, the IRI
function simply converts the above string to an IRI. Continuing this example, this looks like:
<http://wallscope.co.uk/resource/superhero/Daphne+Powell/ABC+Studios>
Finally, the BIND
function assigns the column of superhero entities, like the one above, to the variable ?hero
.
We have now extracted, tweaked and constructed the tabular data into the knowledge graph we want so to summarise we have:
- Cleaned the data slightly in the REFINE view
- Extracted the strings from the columns
- Transformed these strings into language tagged and typed literals
- Used some of these columns to create entities
- Connected these entities and literals in our
CONSTRUCT
Loading the Data:
We can now click “SPARQL Endpoint” to copy it.
By clicking on the “SPARQL” tab on the left, we can now INSERT
our knowledge graph to our triplestore.
The query below (to go in this panel) is extremely similar to our previous CONSTRUCT
query as we are now wanting to save our construction.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX spif: <http://spinrdf.org/spif#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>INSERT {
?hero rdfs:label ?name ;
foaf:gender ?gender ;
dbo:eyeColor ?eyecolor ;
dbp:species ?raceiri ;
dbo:hairColor ?hair ;
dbo:height ?height ;
dbp:publisher ?publisheriri ;
dbp:skinColor ?skiniri ;
dbp:hero ?gvb ;
dbo:weight ?weight .
?raceiri rdfs:label ?race .
?publisheriri rdfs:label ?publisher .
?skiniri rdfs:label ?skin .
}
WHERE {
SERVICE <http://localhost:7200/rdf-bridge/2158527924349> {
?c <urn:col:name> ?nameprep ;
<urn:col:Gender> ?genderprep ;
<urn:col:Eyecolor> ?eyecolorprep ;
<urn:col:Race> ?raceprep ;
<urn:col:Haircolor> ?hairprep ;
<urn:col:Height> ?heightprep ;
<urn:col:Publisher> ?publisherprep ;
<urn:col:Skincolor> ?skinprep ;
<urn:col:Alignment> ?gvbprep ;
<urn:col:Weight> ?weightprep .
BIND(STRLANG(?nameprep, “en”) AS ?name)
BIND(STRLANG(?genderprep, “en”) AS ?gender)
BIND(STRLANG(?eyecolorprep, “en”) AS ?eyecolor)
BIND(STRLANG(?raceprep, “en”) AS ?race)
BIND(STRLANG(?hairprep, “en”) AS ?hair)
BIND(xsd:decimal(?heightprep) AS ?height)
BIND(STRLANG(?publisherprep, “en”) AS ?publisher)
BIND(STRLANG(?skinprep, “en”) AS ?skin)
BIND(STRLANG(?gvbprep, “en”) AS ?gvb)
BIND(xsd:decimal(?weightprep) AS ?weight)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/superhero/”, spif:encodeURL(?nameprep), “/”, spif:encodeURL(?publisherprep))) AS ?hero)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/race/”, spif:encodeURL(?race))) AS ?raceiri)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/publisher/", spif:encodeURL(?publisher))) AS ?publisheriri)
BIND(IRI(CONCAT(“http://wallscope.co.uk/resource/skin/", spif:encodeURL(?skin))) AS ?skiniri)
}
}
In this query we have simply replaced CONSTRUCT
with INSERT
and added a SERVICE
to the WHERE
clause. This is a federated query which will send the query to an endpoint. We of course want to send this query to the endpoint we just copied from OntoRefine as we have tested and know that this will return what we want to INSERT
. To be clear, copy your endpoint into the angled brackets in the line:
SERVICE <http://localhost:7200/rdf-bridge/123456789> {
Once complete, it’s finally time to click “Run”.
and that’s it! To check the data is really there, let’s explore using GraphDB’s visual exploration tool.
Our data is loaded and linked as planned.
Conclusion:
It is totally down to preference which method of transformation you use to create RDF, I have now covered both an interface heavy and query heavy approach so it is up to you to test and decide which you prefer. For huge datasets I still tend to write custom scripts but these methods are suitable for most use-cases.