A new R package for exploring the wealth of information stored by Wikidata: tidywikidatar

What does Wikidata know about members of the European Parliament? Let’s find out using our new R package tidywikidatar. A post by Giorgio Comai (OBC Transeuropa)

An article by Monika Sengul-Jones titled “The promise of Wikidata” published on datajournalism.com a couple of months ago highlighted how Wikidata — a sort of database associated with Wikipedia — could be used by data journalists in a number of ways. Indeed, in the past years using Wikidata as a source has come up in various brainstorming sessions with colleagues contributing to EDJNet, and indeed we will be publishing soon a new material that makes extensive use of Wikidata.

Why aren’t more data journalists using Wikidata? Even beyond issues highlighted by Monika Sengul-Jones in her piece such as (unevenly) incomplete data, we have identified two additional obstacles to wider adoption of Wikidata in this context.

Why aren’t more data journalists using Wikidata?

Firstly, getting data out of Wikidata can be an intimidating task even for people who are familiar with coding. Besides data wrangling, one needs some familiarity with the data structure of Wikidata (this is unavoidable, but it’s not too bad) and with SPARQL database queries, a major pain for those unaccustomed to database languages (see Wikidata’s instructions). Exploration of data — a typical component of data journalism — remains complex, and iterative processes less than intuitive.

Secondly, matching Wikidata identifiers to lists of individuals or objects as found “in the wild” is error-prone, and manual checks can be extremely time consuming.

To deal with this, we have been working on an interface to facilitate matching lists of strings to relevant Wikidata identifiers; we will be releasing it soon and announce it in a dedicated post.

Today, we are instead presenting a new tool, or rather, a package for the R programming language — tidywikidatar — that facilitates interacting with Wikidata for the many data journalists who use R and are familiar with its established data wrangling tools. In brief, tidywikidatar makes it easier to get data from Wikidata and explore them, without having to deal with complex database queries or nested data structures.

To see it in action, in this post we will outline a basic routine for exploring information stored on Wikidata, and find out what Wikidata knows about members of the European Parliament.

Setting up the package

First, you need of course to install the package.

install.packages("tidywikidatar")## or for the development version, with some fixes and performance improvements
# remotes::install_github("EDJNet/tidywikidatar")
## Rstudio users are advised to run the following line, until a recently
## introduced regression in the `RSQlite` package will be fixed
## https://github.com/r-dbi/RSQLite/issues/369
# options(connectionObserver = NULL)

I would also suggest you enable local caching, ideally in a folder that can be accessed by different projects (as it just caches information from Wikidata, there is mostly no reason to keep it in a folder synced with the likes of Dropbox or include it in backups).

library("dplyr", warn.conflicts = FALSE) # data wrangling
library("tidywikidatar")
tw_enable_cache()
tw_set_cache_folder(path = fs::path(fs::path_home_r(),
"R",
"tw_data"))
tw_create_cache_folder(ask = FALSE)

As you see, all tidywikidatar functions start with tw_ followed by a verb describing what the function does.

Some familiarity with Wikidata is useful to follow along (check out this introduction on Wikidata’s own website). At the most basic, you should know that every item in Wikidata has an id (it always starts with a Q, something like Q123456). Each item is described by properties (they always start with a P, something like P1234), and some of these properties have qualifiers. With this in mind, you can just follow along this post and find out more about Wikidata on their own website, and about tidywikidatar on the package’s own website including more examples and more detailed documentation.

We are now ready to start.

Finding out more about MEPs with Wikidata

To find out more about members of the European Parliament, we must first know who they are.

tidywikidatar allows for basic queries such as this one, if it is given a table including couples of properties and values. In our case, let’s ask for everybody in the Wikidata database who has “member of the European Parliament” (Q27169) as “position held” (P39).

meps_df_query <- tibble::tribble(
~p, ~q,
"P39", "Q27169"
)
meps_df_query
meps_df <- tw_query(query = meps_query)

Here we end up with a list of 4579 individuals who, according to Wikidata, have been members of the European Parliament.

Check all properties

That’s a lot of MEPs! What does Wikidata know about them?

Here’s the top twenty MEPs about whom Wikidata has more properties:

### the first time you run this it will take a lot of time
meps_all_properties <- tw_get(id = meps_df)
### if you are just trying out, you may want to just continue with a random sample
# meps_all_properties <- tw_get(id = meps_df %>%
# dplyr::slice_sample(n = 100),
# language = "en")
properties_per_mep <- meps_all_properties %>%
dplyr::filter(stringr::str_starts(string = property, pattern = "P")) %>%
dplyr::distinct(id, property)
total_meps <- length(unique(properties_per_mep$id))properties_per_mep %>%
group_by(id) %>%
count(name = "Number of properties") %>%
ungroup() %>%
arrange(desc(`Number of properties`)) %>%
head(20) %>%
transmute(Name = tw_get_label(id), id, `Number of properties`) %>%
knitr::kable()

Many of these may not be best known for their past as MEP, but indeed, if they are in this list, it means they all have been members of the European Parliament at some point in their life.

The “number of properties” column shows high figures, but many of these properties are simply identifiers in other archives.

What matters most, probably, is how complete these data are.

properties_per_mep %>% 
group_by(property) %>%
count(name = "Number of MEPs") %>%
ungroup() %>%
arrange(desc(`Number of MEPs`)) %>%
head(20) %>%
mutate(`Share of MEPs` = scales::percent(`Number of MEPs`/total_meps)) %>%
mutate(Label = tw_get_property_label(property = property, language = "en")) %>%
transmute(Property = property,
Label,
`Share of MEPs`,
`Number of MEPs`) %>%
knitr::kable()

We really have rather complete data for only a handful of properties, but this is already a start. We can for example quickly find out the gender balance: about 26 per cent of MEPs were women.

Let’s check out another property about which we apparently have rather complete information: their job. We would expect all MEPs to be politicians, but many likely were not only politicians. What were their others occupations (P106)?

meps_all_properties %>% 
dplyr::filter(property == "P106") %>%
group_by(value) %>%
count(sort = TRUE, name = "Number of MEPs") %>%
head(20) %>%
transmute(Occupation = tw_get_label(id = value, language = "en"),
`Share of MEPs` = scales::percent(`Number of MEPs`/total_meps),
`Number of MEPs`) %>%
knitr::kable()

Some interesting hints, but surely also testament of the fact that data are not really complete. This is perhaps not surprising, keeping in mind that this list includes all MEPs starting from 1958 and Wikidata may not have much information about politicians who may have had a brief public career half a century ago. Shall we focus on those who were MEP in the latest legislature?

Who’s been a MEP in which legislature?

This is where Wikidata starts to get a bit more complex, as this kind of information is stored as qualifiers of properties.

If we take, for example, Willy Brandt (Q2514), we can see that he held many positions (P39):

tw_get_property(id = "Q2514", p = "P39") %>%
tw_label() %>%
knitr::kable()

For each of these, we have some qualifiers. Let’s look at what Wikidata knows about Brandt’s stint as a MEP (Q27169).

tw_get_qualifiers(id = "Q2514", p = "P39") %>% # get qualifiers for "position held" of Willy Brandt
filter(qualifier_id == "Q27169") %>% # keep only information about his position as MEP
select(-set) %>% # remove redundant qualifier identifier
tw_label() %>%
knitr::kable()

So now we know that Willy Brandt was a member of the first directly elected European Parliament (Q17315702).

If we are interested only in “parliamentary terms” (P2937) in the European Parliament (Q27169) for all the people on Wikidata we know have been members of the EP… we just have to ask.

mep_terms <- tw_get_qualifiers(id = meps_df$id, p = "P39") %>% # get qualifiers for "position held" of all MEPs
filter(qualifier_id == "Q27169", qualifier_property == "P2937") %>% # keep only details about their position as MEP, and only about their parliamentary term
select(-set) # remove qualifier identifier

Let’s take only MEPs who at any point in time have been members of the Ninth European Parliament (Q64038205) (due to Brexit, many already had to abandon their seat).

mep_9th <- mep_terms %>% 
filter(value == "Q64038205") %>%
distinct(id) %>%
pull(id)

Having more complete data about them, we can ask some more interesting questions. For example, where have they been “educated at” (P69).

meps_with_p69 <- mep_9th %>% 
tw_get_property(p = "P69")

We have this information for 593 out of 795: not bad!

So here’s the top 20 institutions where MEPs have studied at least for some time:

meps_with_p69 %>% 
group_by(value) %>%
count(sort = TRUE) %>%
ungroup() %>%
head(20) %>%
tw_label() %>%
knitr::kable()

Or… how many of them were born in a capital city? We just have to ask for their place of birth (P19), and then ask Wikidata about that place.

Out of 795 MEPs that have been members of the Ninth European Parliament, we know the place of birth for 787 of them.

Which is the city where most MEPs have been born? [I feel this could be one of those clickbaity posts such as “You would never guess number 1”]

meps_with_p19 <- mep_9th %>% 
tw_get_property(p = "P19")
meps_with_p19 %>%
pull(value) %>%
tibble::enframe(name = NULL, value = "Place of birth") %>%
group_by(`Place of birth`) %>%
count(sort = TRUE) %>%
ungroup() %>%
head(20) %>%
mutate(`Place of birth` = tw_get_label(`Place of birth`)) %>%
knitr::kable()

Honestly, I somehow expected more concentration. If we ask Wikidata about those places of birth, it will tell us that almost half of them were born in a “big city” (Q1549591), but given that Wikidata considers such any town with more than 100 hundred thousands residents, this is not surprising at all.

meps_with_p19 %>% 
pull(value) %>%
tw_get_property(p = "P31") %>%
group_by(value) %>%
count(sort = TRUE) %>%
ungroup() %>%
head(10) %>%
tw_label() %>%
knitr::kable()

148 were born in a capital city. Is it a lot? Not so much?

And now that I think about it, which regions are over-represented at the European Parliament if we take the place of birth of MEPs as a starting point?

Mmmm… do you start to feel the power of Wikidata?

The fun thing with Wikidata is that everything is connected, every piece of information added to it is available to everybody, and you can get this kind of information about lots of different subjects, objects, events, abstract concepts… all of this released with a CC0/public domain/“it’s all yours to enjoy” license.

Check out tidiwikidatar’s website for more details about its functions and more examples.

If you feel like exploring further the data presented in this post, here’s a csv file with all Wikidata id of MEPs of the 9th legislation, generated with this script.

--

--

European Data Journalism Network
European Data Journalism Network

Europe explained through data by a transnational consortium of media #EDJNet #ddj