Enriching Reconciled Data with OpenRefine

Karen Li-Lun Hwang
The Bytegeist Blog
Published in
15 min readMay 10, 2018

The landscape of available tools for working with linked open data changes very quickly. Tools certainly come and go, but as a whole they are evolving to make our work with linked open data easier. I take this as a sign that an increasing number of people working with data — or in cases like me, an increasing number of librarians, archivists, and digital humanists working with cultural heritage data — are interested in getting their local data and datasets to play nice and connect to other data around the web.

About a year ago, I wrote a blogpost on how to use OpenRefine to create linked data as part of a research fellowship with METRO, the New York state library council serving NYC and Westchester county. One year and an OpenRefine version later, I explored OpenRefine’s built-in functions as part of my work with SemLab and a workshop I gave at Pratt’s School of Information to discover OpenRefine can now easily be used to take a set of reconciled entities and enrich it with attributes and properties from LOD-enabled datasets around the web, if an endpoint exists that you can query via URL.

Luckily this is possible with popular platforms like Wikidata and VIAF.

What exactly does it mean to “enrich” a dataset? Say you have a list or a dataset that you would like to add specific information to that may not have been part of your original dataset. An example might be to automate the acquisition of gender data for a list of names. Or if you have a list of musicians, you may want to add information about the genre of music associated with each musician or recordings each musician appears in.

Enriching data can be particularly useful for stewards of digital repositories who are interested in adding information to name records or work records. This can also be very useful for digital humanists who have created or compiled a dataset for their research but would like to retrieve more data points that would further inform any analysis or visualization. It can even be used as a method to double-check whether your list of reconciled names has been reconciled correctly by pulling down a value that would help you troubleshoot reconciliation errors at a glance: Can this person in a supposed list of Italian Renaissance painters really have a birthdate of 1852?

In this blogpost I will outline a simple task workflow to enrich a dataset of entities. The post will mostly focus on accomplishing this using entities that have been reconciled to Wikidata URIs, but in order to illustrate the flexibility of this task workflow, I will also discuss how to adapt the method for VIAF URIs in a separate section.

Broad outline of steps

  • Create an OpenRefine project with your existing data.
  • Use URIs reconciled as linked data to build an “http://” query string to retrieve new data from an existing endpoint.
  • Query the resource to retrieve a result set (in our example, in JSON).
  • Transform the retrieved data into the targeted value and format.

What you will need:

  • An internet connection
  • OpenRefine 2.8+ [download]
  • A CSV or XLS document with name entities reconciled as Wikidata URIs. You can download a pre-made sample set I have provided, consisting mostly of contemporary artist names [download set] or your own set. If you need to reconcile your set first, you can follow the instructions from the blogpost I mentioned [instructions].

ENRICHING YOUR DATASET OF WIKIDATA ENTITIES WITH WIKIDATA DATA

Our goal in this exercise will be to retrieve date of birth data for a list comprised primarily of artists that have already been reconciled to Wikidata entities. Again, you can use the sample set provided, but feel free to adapt these instructions for your own target set.

  1. Start with a dataset or reconciled names

In this example, we will work on acquiring dates of birth from Wikidata for a short list of names that have been reconciled to Wikidata entities [download].

Opened in Excel, it looks something like this:

Our sample data in Excel

The more important column here is the Wikidata_uri column. When you use your own dataset, you will need at minimum the column with the Wikidata URI, but your set might have even more columns than this one. Column names don’t matter for the tasks we’re conducting, though they will be referred to as you see here for this exercise, so you may want to use similar naming conventions for your columns.

(Note that not all names were reconcilable to a Wikidata identifier, because not every person on the list was represented by a record in Wikidata at the time. You may run into a similar problem with your own list of names, which is a massive topic of its own.¹)

2. Launch OpenRefine and create the project

Now you will create your project with this file by using the following steps (if you know how to create a project, skip to the next step).

  • Launch OpenRefine
  • Select your file: “Create Project” > “Get data from” > “This Computer”, click on “Choose Files” and navigate your hard drive to the file’s location, “Open”, “Next >>”
  • Configure your data: You will be shown a preview of your data. In the lower blue field, make sure “Parse data as” is set to “CSV / TSV / separator-based files”. Where it says character encoding, click in the blank field next to it and select UTF-8 from the pop-up window of encodings. Make sure the first row with your column headers is recognized as headers (boldfaced) and not as your data. If it is not automatically recognized, check the click box for “Parse next ‘1’ line(s) as column headers”. Since our exercise file is a CSV, activate the radio button “commas (CSV)” as the separator
  • Project name field above: Give it a good name you can recognize (I called mine “OpenRefine Enrichment Test”)
  • Click on the button at the top right “Create Project>>”
Final project configuration settings

3. Extract the Wikidata ID from the URI

In the project space, you will now need to extract the identifier from the Wikidata URI by using OpenRefine tools to match the part of the URI you want. We need this identifier in order to build a query string that will use these identifiers as one variable and the date of birth identifier on Wikidata as the second variable.

For example, in the Wikidata URI representing the artist Ai Wei Wei “http://www.wikidata.org/entity/Q160115", the trailing portion after entity “Q160115” is his unique identifier.

  • Go to column header for Wikidata_uri. To the left there is a small triangle. Open the menu with that triangle: “Edit column” > “Add column based on this column”
  • A pop-up window should have opened. Make sure “Language” is set to “General Refine Expression Language (GREL)”. This window allows you to build what are called expressions to use the original values in a column to build new ones. The default in the expression field is “value”, which as you see in the preview area means that the original column value will simply be repeated in your new column.
“Adding column based on column” settings, expression, and preview windows. Currently showing the default setting of “value”
  • Use GREL’s replace command to only pull out the identifier. That function works like this

replace(original_value, value_substring_to_replace, replacement_string)

  • For our purposes, we will replace everything in our original value with an empty string, except for the trailing part which is the Wikidata identifier. You can copy and paste the following expression into the expression window:
replace(value,"http://www.wikidata.org/entity/", "")

In the preview, you should see that only the identifier will be the value in the new column you are creating.

Using GREL to extract the Wikidata identifier from the URI
  • In “New Column Name”, type “Wiki_id”, and click ‘OK’ to create your column.

3. Use the acquired identifier to query the Wikimedia Foundation endpoint for Wikidata properties

Before building our query for date of birth data, we need to see which property ID date of birth has on Wikidata. All of Wikidata’s properties are assigned a property. A full list can be seen here: https://www.wikidata.org/wiki/Wikidata:List_of_properties/all

Luckily, if you mouseover a property on a Wikidata page or click on the property, it will tell you the property number, which may be faster than searching through the full list. Taking Ai Wei Wei’s Wikidata record as an example, you can see the property for date of birth is P569.

Mousing over a property label in a Wikidata record shows the property’s ID

Wikidata provides an endpoint for querying data as a URL. Once you know the property you would like to retrieve, the objective is to use OpenRefine to build a query string and retrieve the data you want from that endpoint.

An example query is: https://tools.wmflabs.org/openrefine-wikidata/en/fetch_values?item=Q160115&prop=P569 ²

The baseline for the URL is “https://tools.wmflabs.org/openrefine-wikidata/en/fetch_values?” and everything that follows can be considered your query. Entity identifiers begin with a “Q” (as you see in your Wiki_id column) and property identifiers begin with “P”. So your query string (the part after “fetch_values”) is saying: “Give me the values for the property P569 (date of birth) for the entity Q160115 (Ai Wei Wei).”

So how is this achieved for the entire list?

Going back to our project space in OpenRefine:

  • On the new identifier column (Wiki_id), click the triangle and select “Edit column” > “Add column by fetching URLs”
    Again use GREL to create a formula that uses “value”, the built-in OpenRefine variable for your cell’s value, to write the query:
"https://tools.wmflabs.org/openrefine-wikidata/en/fetch_values?item="+value+"&prop=P569"
  • Call the column “date_of_birth_Wikidata” (Retaining the source of the data is a good idea, just in case. I do this by adding it somewhere in the column name as a fool-proof method)
Building the query URLs from the IDs in your new column
  • Click “OK”

Depending on your internet speed and the number of entities in your list, this can take a little while. OpenRefine provides feedback on the progress status while it queries the endpoint, which is extremely helpful.

Once completed, your results should look like this:

On the right, the newly created column by using “Add column by fetching URLs”

The data has been returned in JSON format as a dictionary of keys and values. The syntax for each key-value pair is:

 "key”: value

and these key-value pairs are contained in a dictionary, signaled by the curly brackets surrounding the group in each field.³

You can see that the values(s) we’re interested in can be found in the dictionary item with the literal key “values”, a key name assigned by Wikidata. It is worth saying here that the values for this item are expressed not as a string (which are usually simply enclosed by quotes, like the values for item and prop are), but as a list. You can see this is a list from the square brackets surrounding the values:

"key":[value1, value2]

And interestingly, the first result in the list (Ai Wei Wei’s Wikidata record) has two dates listed as date of birth. This can happen if different information exists on record.

5. Transform the JSON dictionary result to give you the property value(s)

Most people will want to flatten the data to the data value(s), rather than store the entire dictionary returned. To flatten the value:

  • On the new column “date_of_birth_Wikidata” click the triangle and select “Edit Cells” > “Transform”
  • What you type in the expression field will depend on how you want your results. Unless you want to store it as a list, you may choose to loop through the list with something like a forEach statement in your GREL expression to manipulate each value out of the bracketed list. But here are some possible ways you might want to format the data:
  • All values, with multiple values separated by semi-colons:
forEach(value.parseJson().values,v,v).join(";")
  • All values, but retaining their expression as a list:
value.parseJson().values
  • Only one of our date values (for example, the first):
value.parseJson().values[0]

Note: The zero in the square brackets refers to the index number in the list. In most programming languages the index number begins with 0 for the first item. The second item in the list has an index of 1.

  • Once you’re done, click ‘OK’.

You may also choose to reformat the data further, which can be done using data cleaning methods with OpenRefine. I won’t cover data cleaning methods here, because there are many thorough and fantastic instructional videos and blogposts about this elsewhere online.

Your data is now ready to be exported! If you want to export it as a CSV, click on the “Export” button in the upper right-hand corner and select CSV.

And as you probably already suspect, the entire workflow can be followed to enrich your list of Wikidata reconciled entities with any property or even more properties.

ENRICHING YOUR DATASET OF VIAF ENTITIES WITH VIAF DATA

Similar now to how we enriched a dataset of Wikidata entities with Wikidata properties, in order to show you the flexibility of this OpenRefine enrichment workflow, we’re going to enrich a dataset of VIAF entities with data from VIAF, the Virtual International Authority File, which is a very popular authority file in the library community. This section does not give step-by-step details but rather should be adapted to the step-by-step instructions provided in the above Wikidata section.

We will need our data reconciled as VIAF entities in order to enrich them with other properties. Going into too much detail about how to do this lies beyond the scope of this article. But briefly, you can use the built-in “Reconcile” function to accomplish this. If you would like to reconcile a list of names to VIAF URIs, you can adapt the directions for reconciling to Wikidata URIs outlined in a previous blogpost. If you follow those directions, the reconciliation service you will add when going to “Reconcile” > “Start Reconciling” is http://refine.codefork.com/⁴ and the expression you will need to build the entity’s VIAF URI is:

"http://viaf.org/viaf/"+cell.recon.match.id

The sample data we used previously, however, also has a column at the end with VIAF identifiers [download] for you to use.

Now that we have our VIAF URIs ready, let’s decide on what data we want to target from the VIAF record. If we look at one VIAF record, for example, of the Cuban American artist Ana Mendieta from our sample data, we see that there is a field in the record “Selected Publishers”.

Expandable field for “Selected Publishers” in Ana Mendieta’s VIAF record. We will enrich our list with this group of publishers

Let’s target the retrieval of that data to enrich our set.

At this point, you are either in your OpenRefine project space already with your VIAF entity URIs, or you should go ahead and build it as outlined above with your own data. The expression you will use to extract the VIAF identifier into a new column called “VIAF_id” is:

replace(value,"http://viaf.org/viaf/","")

Once you have a column with the VIAF identifiers, you can now build the URL to query the VIAF endpoint, similar to what we did above for Wikidata. You can retrieve the data in a variety of formats, including XML and JSON.⁵ We will query for results in JSON again.

An example of a query would be:

http://viaf.org/viaf/96607499/viaf.json

You’ll notice that there is no indication of a specific attribute, property, or field. For VIAF, we will be requesting the entire record and then transforming to the property we want.

In OpenRefine, go to your newly created VIAF_id column, and select with the triangle “Edit column”> “Add column by fetching URLs”, and in the expression window copy and paste the following:

"http://viaf.org/viaf/"+value+"/viaf.json"
Building the VIAF query

I gave my column the name “VIAF_record”. Again, when you click “OK”, depending on the length of your list, this could take some time.

What is retrieved are full records for the entities in your list. It looks quite horrible at first glance. If you would like to see a record in more human-digestible format, you can copy and paste one of them into an online JSON formatter like the JSON Viewer. It will add tabs and line returns.

We now want to pull out that list of publishers. One way you can find where it is in the record dictionary is to simply take one of the names you see in the online record (see screenshot above), and search for it in the person’s JSON record. So if I take something like “Hatje Cantz”, I find it is a value that belongs to a dictionary key called “ns1:text”. But “ns1:text” is a key in a dictionary that is one of many dictionaries in a list of dictionaries called “ns1:data”. And “ns1:data” is a key inside one of the highest level keys for the record “ns1:publishers”.

Roughly outlined, this is where our data is located:

"ns1:publishers": {"ns1:data":   [
{...
"ns1:text": "The New Museum of Contemporary Art",
...
},
{...
"ns1:text": "Hatje Cantz",
...
}
]},

It still takes some trial and error for me to filter down to the data I want in a query result that is this big and structured, compared to the more straight-forward result from our Wikidata query. I say this to encourage others to adapt these methods and to not get frustrated if it takes a little while to construct the right transformation expression to capture your data.

There are two hints here about what I will need in my transform expression. Because the data is JSON, I will need to use the parseJson() function, and because “ns1:data” is structured as a list of dictionaries, I will almost certainly need a forEach loop. If this means little to you now, I wouldn’t worry. As you try to adapt the expressions we’ll use here to retrieve other data from a VIAF record, this extra information on the thought process could come in handy.

I will simply copy this new column first, since it’s the entire record and I may want more info from it later. Go the the top of our new column “VIAF_record” and use the triangle to select “Edit column” > “Add column based on this column”. Name the column “VIAF_select_publishers” and click “OK”.

Now go to your new column and select “Edit cells” > “Transform”. We could type in one expression, but let’s use two transform steps to make the process clearer.

First, let’s reduce the text to the area we will need to loop over, namely, just the list of dictionaries found under “ns1:data”.

In the expression window, type:

value.parseJson()["ns1:publishers"]["ns1:data"]
Transforming the record down to just “ns1:data”

And click “OK”.

You’ll see the column fields have been reduced immensely and that each only contains a list of similarly structured dictionaries with the publisher information.

Now we’ll loop over each person’s publisher data to only retrieve the publisher name in each dictionary found under the key “ns1:text”. As above, we’ll separate these values by joining them with a semi-colon.

Go again to “Edit cells” > “Transform”. Enter the following loop expression in the expression field:

forEach(value.parseJson(),v,v["ns1:text"]).join(";")
Transforming the data to just the selected publishers using a loop expression and separating them by semi-colons

Click “OK”.

You now have a column with this the list of selected publishers for each name entity in your list if it exists in VIAF. At this point, too, I would hide the full record column “VIAF_record” to make it easier to look at: “View” > “Collapse this column”, or you can even delete the column if you’re done with these full records by using “Edit column” > “Remove this column”.

Using the “Export” button, you can now export the data in your desired format.

Again, as with the Wikidata exercise, you can modify this workflow to enrich your dataset with any information available in the VIAF record. VIAF and Wikidata records also include identifiers in other databases. If any of those have queryable endpoints, you could add those identifiers and follow a similar process to start collecting any of the data provided there.

— -

¹ It should always be kept in mind that working with Linked Open Data can only be as good as the data present in LOD sources. They are constantly growing in coverage, but the inequalities of representation that play out generally in our societies are currently present in Linked Open Data. If you’re interested in reading more on this topic, a discussion of this appears in another blogpost I wrote, “The Vision of Linked Open Data: Martin Wong and the METRO Network”.

² This URL was found on a post on GitHub: “Add column from Wikidata #1179”, https://github.com/OpenRefine/OpenRefine/issues/1179

³ For a very simple explanation of JSON and its structure, you can read this description on the Python for Beginners website: http://www.pythonforbeginners.com/json/what-is-json

⁴ Note: this is work developed by codeforkjeff. For more on this, see his concilator github repo: https://github.com/codeforkjeff/conciliator

⁵ For more on the VIAF URL requests: https://platform.worldcat.org/api-explorer/apis/VIAF

--

--