Querying Wikidata RDF with SQL using RDF2X

David Příhoda
Oct 15 · 10 min read

Linked Data sources store a growing volume of valuable information represented in RDF format. However, relational databases still provide advantages in terms of performance, familiarity and the number of supported tools. We present RDF2X, a tool for automatic distributed conversion of huge RDF datasets to the relational model. In this tutorial, we use RDF2X to convert a music subdomain of Wikidata into the relational model and explore it using SQL queries compared with their SPARQL equivalents.


About Wikidata

Wikidata is a structured collaboratively edited knowledge base of information. Currently, Wikidata stores billions of statements about more than 63 million entities. Wikidata can be accessed through a SPARQL endpoint and through database dumps in RDF and JSON.

The current version of the RDF dataset can be downloaded from the Wikimedia Downloads site updated several times per week.


Converting Wikidata RDF to a relational database

We will submit the conversion job to our existing cluster managed by Apache YARN. Therefore, our dependencies will be installed automatically, we only need the RDF2X executable JAR file that can be downloaded from GitHub. We will use the following configuration parameters:

  • flavor selects the implemented Wikidata flavor
  • input.acceptedLanguage is set to ’en’ to allow only English texts
  • filter.type is set to IRIs of entities whose instances we want to persist: band, country, human, music genre, musical instrument and song.
  • entities.minColumnNonNullRatio is set to 0.1 so that only properties specified for at least 10% instances of a given type are stored in columns, the rest will be stored in the Entity-Attribute-Value table
  • rdf.cacheFile provides a location for caching the collected RDF schema file. This is useful if we want to run the conversion multiple times on different filtered types without having to recollect the schema.

The conversion took approximately 2 hours and 20 minutes on a Spark cluster of five executors with 8GB memory and two cores. When running again with a cached RDF schema, the conversion time was 50 minutes shorter. With each conversion, 2.3 billion triples were processed.


Relational database generated by RDF2X

The resulting database contains 3.8 million entity table rows, vast majority of them stored in the human table (3.4 million). It is not yet possible to filter properties during conversion, therefore, all relation tables were created for all 246 occurring relational properties. We removed the unwanted tables manually, reducing the schema to its final form visible on figure 14.

Schema of selected entity tables. Relations between tables are illustrative, since Predicate relation tables are not restricted to two specific entity types. Rather, each one stores all relations of a given property type.

Wikidata SQL experiments

Now, we can explore the created database. For each topic, we first provide a SPARQL query that can be executed at the Wikidata SPARQL endpoint to produce the desired result. Next, we introduce an equivalent SQL query, demonstrating how Linked Data concepts are naturally translated to the relational model.

Songs by The Beatles

First, let’s consider a simple example of gathering all songs by The Beatles, sorted by date of creation. This can be achieved with the following SPARQL query:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

A corresponding SQL query can be designed simply by selecting from the song table and joining with the band table through the performer relation:

SELECT
date_part(’year’,song.publication_date) as year,
song.label_en as title
FROM band
INNER JOIN performer ON band.id = performer.id_to
INNER JOIN song ON performer.id_from = song.id
WHERE band.label_en = ’The Beatles’
ORDER BY song.publication_date;

In a real-world application, we would reference the Beatles instance by its ID, the join could therefore be performed directly on the performer table.

Members of The Beatles

Next, we can demonstrate the aggregation functionality of SPARQL by fetching all instruments played by members of The Beatles. We will group the result by the person’s name and use the GROUP_CONCAT aggregate to concatenate names of all instruments into a single list:

SELECT
?beatleLabel
?born
(GROUP_CONCAT(?instrumentLabel; separator=’, ’) AS ?instruments)
WHERE
{
?beatle wdt:P463 wd:Q1299 . # member of The Beatles
?beatle wdt:P569 ?born . # beatle has date of birth
?beatle wdt:P1303 ?instrument . # beatle plays an instrument
?instrument rdfs:label ?instrumentLabel . # request label explicitly
FILTER((LANG(?instrumentLabel)) = "en") . # to be able to aggregate

We can achieve the same result using SQL by selecting from the human table, joining with the band table through the member_of relation and with the musical_instrument table through the instrument relation. The instruments can be concatenated using the string_agg function:

SELECT
human.name_en AS beatle,
human.date_of_birth AS born,
string_agg(musical_instrument.name_en, ’, ’) as instruments
FROM band
INNER JOIN member_of ON member_of.id_to = band.id
INNER JOIN human ON member_of.id_from = human.id
INNER JOIN instrument ON instrument.id_from = human.id
INNER JOIN musical_instrument
ON instrument.id_to = musical_instrument.id
WHERE band.name_en = ’The Beatles’
GROUP BY human.name_en, born;

Relatives of The Beatles

Next, let’s look at the Wikidata concept of property inheritance. This can be demonstrated on fetching all relatives of members of The Beatles. The Wikidata property relative (wd:P1038) has multiple subproperties such as mother, father and child, which are used for direct family members. To get our result, first, we define all subproperties using the subproperty of property chained with the * path operator that defines a path of zero or more occurrences. Then, we collect all values for the defined properties:

SELECT
?beatleLabel
?familyPropertyLabel
(GROUP_CONCAT(DISTINCT ?personLabel; separator=’, ’) AS ?relatives)
WHERE
{
# get the direct property for all subproperties of relative
?familyProperty wdt:P1647* wd:P1038 .
?familyProperty wikibase:directClaim ?familyClaim .
# member of The Beatles
?beatle wdt:P463 wd:Q1299 .
# beatle is related to person
?beatle ?familyClaim ?person .
# request labels
?person rdfs:label ?personLabel .
FILTER((LANG(?personLabel)) = "en") .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
GROUP BY ?beatleLabel ?familyPropertyLabel
ORDER BY ?beatleLabel

The concept of property inheritance is not natively supported by relational databases. However, we can achieve the same result by explicitly including all the requested relation tables, merged using the UNION operator:

SELECT
human.name_en,
rel as relative,
string_agg(person.name_en, ’, ’) as names
FROM band
INNER JOIN member_of ON member_of.id_to = band.id
INNER JOIN human ON member_of.id_from = human.id
INNER JOIN (
(SELECT ’relative’ as rel, * FROM relative) UNION
(SELECT ’father’ as rel, * FROM father) UNION
(SELECT ’mother’ as rel, * FROM mother) UNION
(SELECT ’spouse’ as rel, * FROM spouse) UNION
(SELECT ’child’ as rel, * FROM child) UNION
(SELECT ’sibling’ as rel, * FROM sibling)
) relatives
ON relatives.id_from = human.id
LEFT JOIN human person ON relatives.id_to = person.id
WHERE band.name_en = ’The Beatles’
GROUP BY human.name_en, rel
ORDER BY human.name_en, rel;

This suggests a possible improvement of our conversion tool. For relation schema created using the Predicate or Type predicate strategy, instances of a given property could be persisted in all of its parent relation tables, just like instances of a given type are added to all of its superclass entity tables.

Bands from the Czech Republic

Next, let’s combine the musical domain with provided geographical information. For example, using the following SPARQL query, we can fetch music genres of the oldest Czech bands that have their Wikidata entry:

SELECT
?bandLabel
(year(?inception) AS ?year)
(GROUP_CONCAT(?genreLabel; separator=’, ’) AS ?genres)
WHERE
{
?band wdt:P31 wd:Q215380 . # is a band
?band wdt:P495 wd:Q213 . # band is from Czech republic
?band wdt:P571 ?inception . # band has inception year
?band wdt:P136 ?genre . # band has genre
?genre rdfs:label ?genreLabel . # request labels
FILTER((LANG(?genreLabel)) = "en") .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
GROUP BY ?bandLabel ?inception
ORDER BY ?year

The same result can be achieved using SQL by selecting from the band table, joining with the country table through the country_of_origin relation and with the music_genre table through the genre relation.

SELECT
band.label_en as name,
date_part(’year’,band.inception) as inception,
string_agg(music_genre.label_en, ’, ’) as genres
FROM country
INNER JOIN country_of_origin ON country.id = country_of_origin.id_to
INNER JOIN band ON country_of_origin.id_from = band.id
INNER JOIN genre ON band.id = genre.id_from
INNER JOIN music_genre ON music_genre.id = genre.id_to
WHERE country.label_en = ’Czech Republic’
AND band.label_en IS NOT NULL
GROUP BY band.label_en, band.inception
ORDER BY band.inception;

Number of bands by country

Next, let’s focus on countries in general and generate a simple statistic of the number of bands that have their Wikidata entry per country. This can be achieved using the following SPARQL query:

SELECT ?countryLabel (COUNT(?band) AS ?bands)
WHERE
{
?band wdt:P31 wd:Q215380 . # is a band
?band wdt:P495 ?country . # band is from a country

The same result can be achieved using SQL by counting the number of country_of_origin relations between the band table and the country table:

SELECT
country.label_en as name,
COUNT(*) as bands
FROM country
INNER JOIN country_of_origin ON country.id = country_of_origin.id_to
INNER JOIN band ON country_of_origin.id_from = band.id
GROUP BY country.label_en
ORDER BY 2 DESC;

Number of bands per capita

Finally, let’s extend the previous query and calculate the number of bands per million inhabitants. The population of each country is readily available, the result can be achieved using the following SPARQL query:

SELECT
?countryLabel
?population
(COUNT(?band) AS ?bands)
(COUNT(?band) * 1000000 / ?population AS ?perMillion)
WHERE
{
?band wdt:P31 wd:Q215380 . # is a band
?band wdt:P495 ?country . # band is from a country
?country wdt:P1082 ?population . # country has population

To achieve this result in SQL, first we need to check where the population property is stored. Apparently, less than 10% of countries specify their population, the property value is therefore stored as a row in the Entity-Attribute-Value table. First, we use this SQL query to find the numeric identifier of our property:

SELECT predicate FROM _meta_predicates WHERE label = ’population’;

Finally, we can achieve the desired result by selecting from the country table, joining with the band table through the country_of_origin relation and with the _attributes EAV table directly on the country’s primary key:

SELECT
country.label_en as name,
_attributes.value::int as population,
COUNT(*) as bands,
COUNT(*)*1000000/_attributes.value::int as "per million"
FROM country
INNER JOIN country_of_origin ON country.id = country_of_origin.id_to
INNER JOIN band ON country_of_origin.id_from = band.id
INNER JOIN _attributes ON _attributes.id = country.id
AND _attributes.predicate = 4165 AND _attributes.datatype=’INTEGER’
GROUP BY country.label_en, country.description_en, _attributes.value
ORDER BY 4 DESC;

More about Wikibase data model

There are a few small but important features of RDF2X that enable automatic conversion of Wikidata to a relational database.

The Wikidata RDF dump is based on the Wikibase data model. A resource in Wikibase is either a Property or an Item. Property values of Wikidata entities can either be assigned directly or through a Statement node that stores the value as well as qualifiers that specify additional information. For example, a statement about a city’s population might include a qualifier specifying the point in time. Statement nodes are assigned ranks that determine which statement should be prioritized. In this case study, we focus only on so-called truthy statements. These statements have the best rank for a given property and are assigned directly.

An item is assigned a type using the custom instance of statement. Wikidata does not define any distinction between classes or instances, each item can be used in both contexts, as demonstrated by the Piano entity, which is a subclass while it is also referenced by another instance in a many-to-many relationship. Example of Wikidata RDF statements:

@prefix wd: <http://www.wikidata.org/entity/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

As you can see, Wikidata does not follow the convention of human-readable IRIs. Rather, each resource is assigned a numeric suffix. Therefore, it is not desirable to use the IRIs for name formatting. Instead, we will use the RDF2X feature of generating entity and property names from their RDFS labels.

To handle the presence of multiple IRIs for the same entity and other specific Wikidata concepts, we implemented a custom RDF2X Flavor. Firstly, it replicates RDFS labels to all the different IRI prefixes of a property. Secondly, it selects the following default settings:

  • formatting.useLabels is enabled.
  • relations.schema is set to Predicates to create one relation table for each property.
  • rdf.subclassPredicate is set to the subclass of property wdt:P279.
  • rdf.typePredicate is set to the instance of property wdt:P31 and also to the subclass of property to include subclasses as rows in the table of their superclass. This is not semantically correct, but usually provides meaningful results. For example, it includes the piano entity in the music instruments table.

Conclusion

In this tutorial, we provided a low-level demonstration of our tool by converting a music subdomain of Wikidata into the relational model. We explored multiple SPARQL queries and their SQL equivalents in the produced schema, demonstrating the natural representation of Linked Data concepts in the relational model created automatically by our conversion tool.

Check out https://github.com/Merck/rdf2x for RDF2X usage instructions.

In our next article, we explore ClinicalTrials.gov data by converting it to a relational database and visualizing it in Tableau.

David Příhoda

Written by

Software engineer with focus on machine learning and bioinformatics

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade