Querying Wikidata RDF with SQL using RDF2X

David Příhoda
Oct 15, 2019 · 10 min read

Linked Data sources store a growing volume of valuable information represented in RDF format. However, relational databases still provide advantages in terms of performance, familiarity and the number of supported tools. We present RDF2X, a tool for automatic distributed conversion of huge RDF datasets to the relational model. In this tutorial, we use RDF2X to convert a music subdomain of Wikidata into the relational model and explore it using SQL queries compared with their SPARQL equivalents.


About Wikidata

Wikidata is a structured collaboratively edited knowledge base of information. Currently, Wikidata stores billions of statements about more than 63 million entities. Wikidata can be accessed through a SPARQL endpoint and through database dumps in RDF and JSON.

The current version of the RDF dataset can be downloaded from the Wikimedia Downloads site updated several times per week.


Converting Wikidata RDF to a relational database

We will submit the conversion job to our existing cluster managed by Apache YARN. Therefore, our dependencies will be installed automatically, we only need the RDF2X executable JAR file that can be downloaded from GitHub. We will use the following configuration parameters:

  • flavor selects the implemented Wikidata flavor
  • input.acceptedLanguage is set to ’en’ to allow only English texts
  • filter.type is set to IRIs of entities whose instances we want to persist: band, country, human, music genre, musical instrument and song.
  • entities.minColumnNonNullRatio is set to 0.1 so that only properties specified for at least 10% instances of a given type are stored in columns, the rest will be stored in the Entity-Attribute-Value table
  • rdf.cacheFile provides a location for caching the collected RDF schema file. This is useful if we want to run the conversion multiple times on different filtered types without having to recollect the schema.

The conversion took approximately 2 hours and 20 minutes on a Spark cluster of five executors with 8GB memory and two cores. When running again with a cached RDF schema, the conversion time was 50 minutes shorter. With each conversion, 2.3 billion triples were processed.


Relational database generated by RDF2X

The resulting database contains 3.8 million entity table rows, vast majority of them stored in the human table (3.4 million). It is not yet possible to filter properties during conversion, therefore, all relation tables were created for all 246 occurring relational properties. We removed the unwanted tables manually, reducing the schema to its final form visible on figure 14.

Schema of selected entity tables. Relations between tables are illustrative, since Predicate relation tables are not restricted to two specific entity types. Rather, each one stores all relations of a given property type.

Wikidata SQL experiments

Now, we can explore the created database. For each topic, we first provide a SPARQL query that can be executed at the Wikidata SPARQL endpoint to produce the desired result. Next, we introduce an equivalent SQL query, demonstrating how Linked Data concepts are naturally translated to the relational model.

Songs by The Beatles

First, let’s consider a simple example of gathering all songs by The Beatles, sorted by date of creation. This can be achieved with the following SPARQL query:

SELECT (year(?date) as ?year) ?songLabel
WHERE
{
?song wdt:P31 wd:Q7366 . # is a song
?song wdt:P175 wd:Q1299 . # song is performed by The Beatles
?song wdt:P577 ?date . # song has publication date
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
ORDER BY ?date

A corresponding SQL query can be designed simply by selecting from the song table and joining with the band table through the performer relation:

In a real-world application, we would reference the Beatles instance by its ID, the join could therefore be performed directly on the performer table.

Members of The Beatles

Next, we can demonstrate the aggregation functionality of SPARQL by fetching all instruments played by members of The Beatles. We will group the result by the person’s name and use the GROUP_CONCAT aggregate to concatenate names of all instruments into a single list:

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
GROUP BY ?beatleLabel ?born

We can achieve the same result using SQL by selecting from the human table, joining with the band table through the member_of relation and with the musical_instrument table through the instrument relation. The instruments can be concatenated using the string_agg function:

Relatives of The Beatles

Next, let’s look at the Wikidata concept of property inheritance. This can be demonstrated on fetching all relatives of members of The Beatles. The Wikidata property relative (wd:P1038) has multiple subproperties such as mother, father and child, which are used for direct family members. To get our result, first, we define all subproperties using the subproperty of property chained with the * path operator that defines a path of zero or more occurrences. Then, we collect all values for the defined properties:

The concept of property inheritance is not natively supported by relational databases. However, we can achieve the same result by explicitly including all the requested relation tables, merged using the UNION operator:

This suggests a possible improvement of our conversion tool. For relation schema created using the Predicate or Type predicate strategy, instances of a given property could be persisted in all of its parent relation tables, just like instances of a given type are added to all of its superclass entity tables.

Bands from the Czech Republic

Next, let’s combine the musical domain with provided geographical information. For example, using the following SPARQL query, we can fetch music genres of the oldest Czech bands that have their Wikidata entry:

The same result can be achieved using SQL by selecting from the band table, joining with the country table through the country_of_origin relation and with the music_genre table through the genre relation.

Number of bands by country

Next, let’s focus on countries in general and generate a simple statistic of the number of bands that have their Wikidata entry per country. This can be achieved using the following SPARQL query:

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
GROUP BY ?countryLabel
ORDER BY DESC(?bands)

The same result can be achieved using SQL by counting the number of country_of_origin relations between the band table and the country table:

Number of bands per capita

Finally, let’s extend the previous query and calculate the number of bands per million inhabitants. The population of each country is readily available, the result can be achieved using the following SPARQL query:

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
GROUP BY ?countryLabel ?population
ORDER BY DESC(?perMillion)

To achieve this result in SQL, first we need to check where the population property is stored. Apparently, less than 10% of countries specify their population, the property value is therefore stored as a row in the Entity-Attribute-Value table. First, we use this SQL query to find the numeric identifier of our property:

Finally, we can achieve the desired result by selecting from the country table, joining with the band table through the country_of_origin relation and with the _attributes EAV table directly on the country’s primary key:


More about Wikibase data model

There are a few small but important features of RDF2X that enable automatic conversion of Wikidata to a relational database.

The Wikidata RDF dump is based on the Wikibase data model. A resource in Wikibase is either a Property or an Item. Property values of Wikidata entities can either be assigned directly or through a Statement node that stores the value as well as qualifiers that specify additional information. For example, a statement about a city’s population might include a qualifier specifying the point in time. Statement nodes are assigned ranks that determine which statement should be prioritized. In this case study, we focus only on so-called truthy statements. These statements have the best rank for a given property and are assigned directly.

An item is assigned a type using the custom instance of statement. Wikidata does not define any distinction between classes or instances, each item can be used in both contexts, as demonstrated by the Piano entity, which is a subclass while it is also referenced by another instance in a many-to-many relationship. Example of Wikidata RDF statements:

# Assigning labels
wd:Q5 rdfs:label "human"@en .
wd:Q254 rdfs:label "Wolfgang Amadeus Mozart"@en .
wd:Q5994 rdfs:label "piano"@en .
wd:Q52954 rdfs:label "keyboard instrument"@en .
wd:P1303 rdfs:label "instrument"@en .
# Mozart is an instance of Human
wd:Q254 wdt:P31 wd:Q5 .
# Mozart plays the piano
wd:Q254 wdt:P1303 wd:Q5994 .
# The piano is a subclass of keyboard instruments
wd:Q5994 wdt:P279 wd:Q52954 .

As you can see, Wikidata does not follow the convention of human-readable IRIs. Rather, each resource is assigned a numeric suffix. Therefore, it is not desirable to use the IRIs for name formatting. Instead, we will use the RDF2X feature of generating entity and property names from their RDFS labels.

To handle the presence of multiple IRIs for the same entity and other specific Wikidata concepts, we implemented a custom RDF2X Flavor. Firstly, it replicates RDFS labels to all the different IRI prefixes of a property. Secondly, it selects the following default settings:

  • formatting.useLabels is enabled.
  • relations.schema is set to Predicates to create one relation table for each property.
  • rdf.subclassPredicate is set to the subclass of property wdt:P279.
  • rdf.typePredicate is set to the instance of property wdt:P31 and also to the subclass of property to include subclasses as rows in the table of their superclass. This is not semantically correct, but usually provides meaningful results. For example, it includes the piano entity in the music instruments table.

Conclusion

In this tutorial, we provided a low-level demonstration of our tool by converting a music subdomain of Wikidata into the relational model. We explored multiple SPARQL queries and their SQL equivalents in the produced schema, demonstrating the natural representation of Linked Data concepts in the relational model created automatically by our conversion tool.

Check out https://github.com/Merck/rdf2x for RDF2X usage instructions.

In our next article, we explore ClinicalTrials.gov data by converting it to a relational database and visualizing it in Tableau.

David Příhoda

Written by

Software engineer with focus on machine learning and bioinformatics

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade