Wikibase for Research Infrastructure — Part 1

Wikibase is software developed by Wikimedia Deutschland that extends MediaWiki, the software behind Wikipedia to use structured data. Wikidata is powered by Wikibase and the principal user of the software. But it can be run outside of the Wikimedia ecosystem and used for other purposes. This article will look at getting started with Wikibase for your own research project with the preliminary steps of getting the software running with basic configuration and loading data.

But Why?

For our research projects at Semantic Lab @ Pratt we been thinking about infrastructure to power our investigations. We’ve generated a lot of data from our Linked Jazz project and are constantly producing new data from various research and projects. We wanted a tool that can support arbitrary data but also supports linked data practices and methodologies. We were using MySQL and Apache Marmotta combo to store our data for the past few years. This worked fine, though we never really utilized Marmotta’s LDP capabilities. However Wikibase is appealing for a number of reasons:

  • Statement level provenance (what Wikibase calls references)
  • Revision tracking and history
  • A nice user interface for manual editing and curation
  • An API to do bulk data work
  • SPARQL endpoint

The user interface is a huge deal for me. Not everyone is going to be comfortable running a SPARQL update query and just having an avenue to manually lookup a resource and edit it is very empowering. It humanizes and abstracts away a lot of the baggage associated with linked data.

Why not just use Wikidata?

This is a good question to ask about your project. For us, we are going to have a lot of esoteric data, for example modeling oral history transcripts down to the statement level. We will end up storing a lot of data that we use to power our tools and research but is really not appropriate to put into Wikidata. We are thinking of our Wikibase installation as our datastore and sandbox. We will maintain mappings to Wikidata properties and items and publish data back into that system. If anything, using the same software will make it easier to contribute valuable data back to Wikidata.

Installation and Configuration

The Wikibase developers have been kind enough to make a Docker image for the Wikibase stack. We will be using these images to get Wikibase running locally. First step will be to install Docker if you do not have it already. I will be doing this on OSX, but should be similar on Linux.

One thing to make sure in your docker configuration is you have enough RAM allocated. The docs recommend 4GB so adjust your settings to at least that amount. You will know if this is a problem because the SPARQL endpoint will not work, if that is the case make sure you have allocated enough RAM in your settings.

We now want to modify the configurations and settings for the software. Docker works by following a Docker file instructions on how to build the image while Docker-Compose is basically a tool and configuration to organize multiple Docker images. To get Wikibase running it needs the Wikibase software, a Blazegraph database, a MySQL database, etc. all of which are its own Docker image. So the docker-compose file does this all for you. We need to download these Docker instructions. In a terminal/command line:

git clone https://github.com/wmde/wikibase-docker.git

We need to modify the LocalSettings.php file that contains all the configuration instructions on how Wikibase should work. For right now I’m going to just disable public edits and account creation, change the name of the site and maybe add our logo, but the amount of things you can configure is very large. The Wikibase folks have made this very easy for us. We can modify wikibase/1.30/LocalSettings.php.template file. Here is what my file looks like:

The only thing I added are line 28,29,31. And changed the name on line 21.

If you want to add new files to the Docker image, in this case I’m adding a new logo, you have to make sure you tell Docker to do that as well. To do this I modified the Docker file ( ./wikibase/1.30/Dockerfile ) for the Wikibase image to include my logo file. You can see how to do that here.

Here I’m choosing to use the 1.30 version, (at the moment 1.29 and 1.30 are available in this distribution) If you choose to use the 1.30 version make sure you change the ./docker-compose.yml file to reflect that (see this example).

We should now be all set to build our Wikibase stack. In the directory where thedocker-compse.yml is we run:

docker-compose build

Should see “Successfully built 606ac7f9900b`” or something similar.

Now we start the stack:

docker-compose up

And it should start the applications. You should be able to visit http://localhost:8181/ and see the main page. We’ve disabled account creation but you can log in as admin: User “admin” Password “adminpass”

Our own Wikibase!

You can now start creating items or properties found under the Special pages link on the left and then Wikibase section. But we probably want to load data and not have to do everything manually.

Bootstrapping Data

This fresh install of Wikbiase has no items and no properties, you’re free to model the world(!) but let’s make it easier on ourselves and load some items and properties via our favorite information management tool, spreadsheets.

We are going to use Python and CSV files to load items and properties. We will need to have Python installed and are going to use two modules, WikidataIntegrator and pywikibot so visit those pages and make sure you install them (via pip or otherwise)

I’ve put together some scripts and example CSV files to demonstrate loading data into Wikibase: https://github.com/SemanticLab/data-2-wikibase

The first step we need to do is make a bot account on our Wikibase. This bot account will be used to made bulk edits on our data. We do that through the Special Pages link:

Add bot 🤖 account

We are going to want to copy the password which will look something like “bot@4vvaepj4quu1ahbmporh1ujk9qh0pqd4”

We now need to configure our scripts to use these credentials. Let’s make sure we have cloned the demo repo and edit the files

git clone https://github.com/SemanticLab/data-2-wikibase.git
cd data-2-wikibase

To use pywikibot there is a file called password we need to edit and change this password. You can also edit the file user-config.py to change the your Wikibase site name.

WikidataIntegrator is a little more complicated. It is a tool that was developed specifically to add gene data to Wikidata. So it comes preloaded with some configurations. We need to at the minimum, for the moment, just change the bot password used. That is found in add_items.py on line 30.

We are now ready to load data. add_items.py to add items add_properties.py for properties. The first thing we are going to do is add some properties. These are very Linked Jazz specific things, but we can use them as examples. The file add_properties.csv is a CSV of things we want to add. We can run the script

python add_properties.py add_properties.csv

Should see results like:

Logging in to semlab:semlab as Admin
P2 instance of
P3 project
P4 wikidata ID
P5 dbpedia ID
P6 LJ Slug ID
P7 LJ square image

So if we go to our localhost page http://localhost:8181/wiki/Property:P6

We can see we added that property, in this case a unique legacy identifier we use in the Linked Jazz project.

Now that we have some properties we can add in core items, these are some basic entities, you could say Classes, that our data is going to be based on, that are found in add_core_items.csv

agent                     Core class for things
person Class for people specifically
oral history transcript Class for for transcripts
project What Semlab project does this belong to?
Linked Jazz A specific Semlab project

So now we could add some data, like a individual and say:

  • This is a person
  • They are part of the Linked Jazz research project
  • Here is some identifiers and image for them

So let’s add the core items

python add_items.py add_core_items.csv

Once these scripts run it will add a new file in the directory telling you what Property ID or Item IDs were created. It will also create and error file if something went wrong creating one of them.

And once that is done we can add some individuals. It is worth taking a look at how this sheet is structured. We can use the info returned to now define what statements should be built for each item:

The header row tells the script what property to use and what datatype it is. For example Col E is saying Assign a P2 statement (instance of) to Q3 (person) and Col F is Assing P3 (project) to Q6 (Linked Jazz)

Label and Description will go into the item label and description if populated.

python add_items.py add_jazz_people.csv
It’s happening

The script will then populate the items. When it’s done we can then look at one of the individuals:

Display of the populated Item page

We can use this workflow to model and import more research data.

If you want to start over you can simply stop the Docker images running and remove the containers and data volumes:

docker rm $(docker ps -aqf name=wikibase)
docker volume rm wikibasedocker_mediawiki-images-data wikibasedocker_mediawiki-mysql-data wikibasedocker_query-service-data

This will erase the data, you can then do docker-compose up again and have clean slate. You can also dump and restore the data at will as described here.

What’s Next?

This is the most basic setup and data added to a Wikibase instance. We will be trying this approach out over the coming months to see if it is a viable platform to power our research. I’m particularly interested in:

  • Adding detailed data provenience information via References
  • Creating more complex modeling
  • Maintaining mappings from Wikibase properties to more conventional LOD predicates (“instance of” == “rdf:type” for example)
  • Using Wikibase to power our other tools like our custom dereferencing endpoints
  • Maintaining mappings to Wikidata and harvesting data
  • Scale and speed, does our server perform as needed, how long does it take to modify X number of items, etc.