Creating Linked Data
In my previous article we explored what RDF looks like and the pros and cons of the three most common RDF formats. Now that we understand RDF, we can move on to actually creating some! If you want to read along but just download the created RDF, you can download it here in RDF/XML or Turtle format.
Getting Everything You Need
To make things as clear as possible, lets start with an easy to understand CSV. On Kaggle, there is a dataset called “120 Years of Olympic History” which we can download. This dataset is about a subject that we are all familiar with and it contains entities (locations, sports, athletes, etc…) that are also contained in other linked data stores such as DBpedia.
Once downloaded, we need to get OpenRefine and its RDF Extension. With larger more complex datasets, I tend to write custom Python scripts or utilise Apache Jena to transform the data. This extension for OpenRefine is fairly intuitive however and is a good tool to use in the beginning as it ensures (almost always) that your RDF is valid.
In this article I am using this RDF Extension for OpenRefine 2.8 which you can download from this page. This is not the most up to date version of OpenRefine but the most stable RDF extension at the time of writing is not compatible with 3.0 yet.
Once you have installed both of these and followed the extension instructions to link the two, you should start Refine and see this page on port 3333:
The first thing you want to do is click “Choose Files” and select the downloaded Olympics dataset. Once you do that and click next you should see:
On this page you can parse the data you are uploading. In this case the column headings are in the first row of the CSV, so you shouldn’t need to change any of the settings. You can name your project in the top-right text-box and click “Create Project” to move into your new project:
This is all 271,116 rows of the parsed Olympics CSV. OpenRefine is a very powerful tool if you want to explore data or do some data wrangling. For example, you can transform columns of data into numbers by clicking “To number” in the dropdown shown here:
If we do this for the: Age, Height, Weight and Year columns we can see that OpenRefine is now aware that these columns contain numerical data:
This is very useful if you want to explore the distribution of the values in a numerical column or filter between ranges as OpenRefine has a brilliant set of filter and facet tools.
Our Target Knowledge Graph
Our intention is to transform this data into RDF, but before this it is important to design our knowledge graph. There is no ‘correct’ knowledge graph so feel free to create your own. This is what I came up with:
Once again (as in my previous articles) entities are circles, literals are rectangles and predicates are represented by arrows. It is good practice to give your entities names (labels) and types where appropriate as it makes it easier to search for all people in your graph for example and get human readable names for an entity. I will explain a lot of my reasoning throughout this article but do plan on writing a more detailed article explaining how to design a good knowledge graph.
To finally get started, you have to click “Edit RDF Skeleton…” in the dropdown at the very top-right corner of the interface:
This will open the following window:
This is where we start creating our RDF and you can see from the above screenshot that I have changed the Base URI.
Note: The correct Base URI is:
The trailing “/” is important!
You can see in the screenshot that I missed the trailing “/” which follows throughout the article. When compared with the text in the screenshots, you can see in each of the screenshot captions that the expressions are a lot simpler when you add the trailing “/”. If you are following along practically, copy the captions to ensure everything is correct.
This allows us to create entities without having to type this base URI every single time. We will begin by creating the athlete entity by clicking “(row index) URI” and filling the window as follows (custom expression in caption):
To create the athlete URI, we take the athlete’s name and append it to our base URI. To do this we select “Name” as the cell we wish to use content from and “as a URI” because we want to create an entity. We control how the URIs are formed by creating a custom expression which is written in Google Refine Expression Language (GREL). We do not need to prepend the expression with the base URI, as we set it earlier, but I have shown it explicitly for the first example.
This custom expression is applied to each row of the Name column to create the athlete entities.
text.replace("a","b") replaces every
text with a
b in GREL. Therefore, for the Name cell containing “M’Bairo Abakar”, the expression:
“http://wallscope.co.uk/resource/olympics/athlete/” + value.replace(“ “,””).replace(“‘“,””)
does the following:
value.replace(" ","")removes all spaces (replaces spaces with nothing)
.replace("'","")removes all apostrophes
"http://wallscope.co.uk/resource/olympics/athlete/" + valueprepends that string.
This transforms “M’Bairo Abakar” into the URI:
Unfortunately, many names in this column contain invalid URI characters, such as commas and curly brackets, so a regex replace can be created to solve this. Instead of the custom expression discussed above, the expression in the caption of the screenshot can be used:
“athlete/” + value.replace(/[^a-zA-Z0–9]/,””)
Firstly, as we have set our base URI, we can remove it from the string. Next we replace all non-alphanumeric characters with an empty string to ensure no invalid characters have to be URL encoded by Refine in the URI. This would still be valid RDF but Refine would also encode the “/” characters which we do not want.
After creating this URI, we click “OK” and need to add a predicate. To do this, click “add prefix” on the RDF Schema Alignment screen which should open the following window:
If you type “foaf” like in the image above, the URI for foaf should appear in the URI box when you click inside it. This happens for all the most common ontologies used to create linked data. Once you click OK, the foaf ontology will be loaded into Refine.
We are adding foaf as the athlete entities need the type “Person” which is contained within the foaf ontology. To add this type, click “add rdf:type” below the newly created “Name URI”, type foaf:Person and click on
foaf:Person in the list:
We next want to add a predicate to connect the athlete URI to the sex URI as in the knowledge graph design. This again requires the foaf ontology as we can use
foaf:gender for this. Click “property?” and add the predicate in the same way we added the Person type:
Now we need to create the sex entity for this predicate to aim at. To do this, click to the right of the newly created predicate to create a new RDF node:
Here we choose the “Sex” column to create the URI and write a simple custom expression. As you can see, we do not need to prepend the entire base URI when creating an entity, just the appropriate ending (don’t forget the trailing slash). As I mentioned, I had missed the trailing slash in my base URI and therefore had to add “olympics/” to my custom expression. Ensure your set base URI has a trailing slash so that you can use the custom expressions in each screenshots caption.
We have now created triples and your RDF skeleton should look like this:
We have the athlete entities of type
foaf:Person attached to their gender!
I will start skipping steps here that we have already covered but will try to make it as clear as possible. We next add the
rdfs prefix to Refine as we want to add the name literal to the sex entities we just created. Entity names are always linked using the
rdfs:label predicate like so:
This name is a literal instead of a URI so we do not want to treat the cells as URIs but text instead:
We are once again using the “Sex” column, that we used to create the entities, but with a different kind of custom expression. In GREL an if statement takes the form:
if(condition,”value if true”,”value if false”)
Therefore in our expression we check if the value in the cell is “M”. If it is then the attached literal is “male” and if not, “female”. This does create duplicate triples in our RDF but triplestores do not add triples that already exist in the system so it is not a major problem. If storage is of concern however, these triples could be added by selecting “Constant Value” in the “Use content from cell…” box (scroll down in box) and typing them both manually.
We now have the following RDF skeleton:
We can see what our RDF is beginning to look like by clicking “RDF Preview”. The preview is in Turtle format as it is the most human readable format as I discussed in my previous article. The preview should now look something like this:
This duplication of gender entity labels can be seen here but, as mentioned, is not a problem. We have athletes connected to their type and genders which each have labels.
The next interesting literals to add are age (
foaf:age), height (
dbo:height) and weight (
dbo:weight). Let’s look at adding the athletes’ heights:
Both age and height are integers so selecting “as integer number” creates typed literals (with type
xsd:integer). Some athletes do not have heights and instead the cell contains “NA”. To combat this, the custom expression only returns a value if the cell does NOT contain “NA”. The exact same custom expression is used to create the weight literals but these are not all integers. To create the weight literals we must select “as non-integer number” to give them a type of
We must next create the team entities as each athlete belongs to a team. This is very simple compared to some of the others we have done:
After adding all this information, our RDF preview has become a lot more interesting:
Our prefixes are at the top of the file and our athlete entities have a lot more attached information. We can see that the age and height entities have the attached type
int and the weight has the type
double. This contrasts with the gender label which is a plain literal and therefore does not have a type.
Our next goal is to create the country NOC entities which will link to the teams. To do this you must click “Add another root node” on the bottom-left of the interface. This is again a relatively simple custom expression:
To create the games entities, we use the “Games” column and replace the space with a “/” so that “2012 Summer” becomes:
using the following custom expression:
Note that in the caption regex expression there is a space character after the 9. This keeps that character for the next replace to work on.
So far we have only used existing prefixes for our predicates such as foaf, rdfs and dbo. Sometimes a suitable predicate does not exist so we must add our own (this is not best practice but sometimes unavoidable). To do this, click “add prefix” as usual but fill the window with our own details:
Once this is created, the prefix
walls then represents:
Notice that entities are “/resource” but predicates are “/ontology” and these should not be mixed.
This prefix allows us to link games with their season using the predicate
walls:season which you can see in the skeleton soon.
The event entities are similar to the others, using the event column as a URI with a custom expression that removes non-alphanumeric characters:
”event/" + value.replace(/[^a-zA-Z0-9]/,"")
The medal entities are a little more complex as the majority of the rows are “NA”. We therefore have to implement the same trick that we used before, with athletes’ heights and weights, using the
if in GREL:
This returns nothing if the medal cell is “NA” and the entity URI otherwise.
Continuing to build the knowledge graph, our RDF Skeleton grows to look like this:
In this image you can see what predicates, types and structure I have built. We have almost everything but the central instance entity that ties everything together is missing. This instance entity links athletes, medals, events and games so that you can find out which event an athlete got their gold medal in for example. To do this we must create instance URIs and we must make them unique. An athlete can get two silver medals in the same Olympic games so the event must also be included to ensure every URI is unique. This requires a new GREL expression as we can only select one cell to use at a time. Selecting the event cell to be used as a URI, we write the following custom expression:
“instance/” + value.replace(/[^a-zA-Z0–9]/,””) + “/” + cells[“Games”].value.replace(/[^a-zA-Z0–9 ]/,””).replace(“ “,”/”) + “/” + cells[“Name”].value.replace(/[^a-zA-Z0–9]/,””)
This custom expression looks a lot more complicated but is fairly simple. The first section
"instance/" + value.replace(/[^a-zA-Z0-9/,"") is just like the other regex expressions we have already seen. We then add a slash to break the URI with
+ "/" before grabbing the value of the “Games” column in the same row with
cells["Games"].value. We replace spaces with slashes like when we created the games entities and add another slash before adding the value in the “Name” column to the URI.
These URIs are not human readable but do not need to be, they are unique and we will never want to find these anyway, they are simply used for linking. For this reason we do not need to add an
We finally connect these instance entities to the athlete, game, event and medal entities with custom predicates. The final RDF Skeleton looks like this:
Saving Our RDF
To save your RDF click “Export” and then “RDF as RDF/XML” or “RDF as Turtle” depending on which you want to save your linked data as:
Larger files download more easily as RDF/XML (due to the way browsers handle the download) so here is our final RDF/XML opened in Sublime:
We have finally transformed the CSV data into the desired knowledge graph! The more you do this, the easier it becomes (especially as your knowledge of existing predicates expands). In my next articles I plan to go over querying RDF and compare some of the most common triplestores to display the benefits of transforming this data.
If you wish to download the created RDF, you can find it here in both RDF/XML and Turtle.