My vision of a possible strategy. How could we change Wikidata’s core infrastructure to be able to scale better in the future?


Given the recent growth in the number of items inside Wikidata there was the idea voiced that we might need tools for dealing with more data. One effect that the recent FactGrid workshop had on me was the appearance that different people have quite a different idea of how Wikidata works despite them spending serious amounts of time with it.

I personally hold the belief that there’s often huge synergy that comes from hosting multiple kinds of data together in a single database. I would be happy if Wikidata manages to double it’s item count every year for the next five years.

Increasing the amount of data that we host means that there’s increased synergy between different datasets. Olaf Simmons pointed out in [https://blog.factgrid.de/archives/492 Kopfzerbrechen Nr. 3: Genealogien] that given the kind of research he does as historian, having data that comes from even data that comes from genealogy would be very valuable for the kind of research questions that he asks. In addition to the pure data of genealogy, it’s also a topic that’s interesting because it draws a sizable community of people who are interested in this kind of structured data. If they would come to Wikidata to enter genealogical information, there’s a good chance that at least some of them would also take up other tasks on Wikidata. Integrating them into our community is a valuable opportunity for membership growth.

When it comes to data, I think it’s worthwhile when we uphold quality standards. It’s okay when we only allow items about people that can be described with references. On the other hand, I see no need to consider certain people to be too insignificant to be listed.

If we would have told Ewan McAndrew, the Wikimedian in Residence at the University of Edinburgh that we don’t want the items for alleged British witches about whom very little data is available, exclusionism like this would have held our project back.

What kind of problems do we get when we scale? Jura mentioned repeatedly that an increased number of low quality GeoNames items makes the work of integrating new datasets via Mix and Match harder. Having duplicate items increases the work it takes to use Mix and Match. I consider this a concern that raises the valid need to get better about matching duplicates.

One the data side this means that we want items that have enough information in them that we can reasonably find items that are supposed to be matched. On the software side this underlines that better tools for merging are valuable.

What’s the problem with our current merging? It’s hacky. MediaWiki by default doesn’t have a concept of merging. As a result “create a new item” suggested for a long time “make sure the item does not already exist! (If you make a mistake, you can request your item’s deletion here.)” even through the way we deal with items who are mistakenly created as duplicates is to merge them and not delete them. At the time the page was written the concept of merging wasn’t yet developed.

Later we created a combination of a gadget and a bot. The gadget produces multiple edits inside of the article lists of the two items in question. Afterwards the bot changes all links to the obsolete item to the item into which both get merged. In certain cases this can be hundreds of edits that are done in a fashion that’s not easy to trace back.

Given that status quo undoing merges is a lot more complicated then just pressing undo as you would undo any normal edit. In cases where a bot already redirected thousands of links it might even be currently impossible to cleanly undo a merge.

This makes us uneasy about giving new members easy access to the merge tool. We hide it behind a gadget that has to be activated in preferences and we also don’t tell users in the “create a new item” dialog about it.

Other people who setup their own Wikibase installations won’t have our bot by default that does the relinking for them, so they will have a worse experience with merging. This is an additional reason for why it might be useful to move to fixing the underlying process.

I have already hinted at the fact that I think an ideal solution to the merging problem would introduce a new concept of a batch of edits that can span multiple items that can be undone with one click. Giving the way that MediaWiki is very page centered, there might be some resistance to introducing such a concept of a batch that can show up on watchlists.

Merging isn’t the only concept where this new concept of batches comes in useful. It will also become very useful when dealing with QuickStatement batches. Like merging, QuickStatement also allow one user to do a lot of edits that aren’t easy to undo. We currently have discussions about whether we should require large QuickStatements imports to receive bot approvals because of the difficulty of dealing with bad QuickStatements batches. If we would have the functionality of one-click undoing of those batches, we would have less need to write policies that put up barriers against adding large data sets via QuickStatements.

Currently, we have no good place to discuss specific imports of QuickStatements batches. With a concept of a batch, the batch could also get it’s own discussion site where it can get discussed. Besides the needs of Wikidata I could also expect that WikiCommons finds unclick undoing of adding a batch of edits with a tool like PattyPan to be useful. Maybe, it can even be pitched to the Wikipedia’s to deal better with bots that are doing a large number of edits. If one bot would do 2000 edits in one batch, the watchlist of a person who watches multiple of the watched items could manage to show only one edit. Having easier tools to undo bot edits might be even welcomed by EnWiki administrators.

Batches would generally be useful for dealing with moderating edits by tools that make edits to multiple items. Having an easy way to undo these actions would allow us to proclaim the general principle Wikipedia principle of being bold with merging, QuickStatements and other tools, where currently the huge work of undoing such batch edits makes us more careful.

To get back to the topic of merging, the next problem is to know which articles should be merged. To me this seems like a task where AI can help a lot. It’s a clear use-case where good AI can produce a lot of value for Wikidata in contrast to the case of building an AI that’s supposed to rate quality without thinking about use-cases in the beginning of the design only to find out at a later stage that the resulting AI doesn’t really deal well with possible use-cases for it.

Pasleim’s https://www.wikidata.org/wiki/User:Pasleim/projectmerge does currently provide some candidates but someone who spends more effort into deploying algorithms can likely improve on those results. From it’s scope it might be a project for a bachelor/master thesis.

What do we do with our list of merge candidates once we have it? We currently don’t want to have decisions like this done without human input. When it comes to verifying human decisions with tools we have three ways we do that currently:

  1. The primary sources tool
  2. The Mix-and-Match tool
  3. Wikidata Games

They all have their own advantages of disadvantages. To discuss them I would like to look at the case of WikiFactMine. When I read the proposal for WikiFactMine I was enthusiastic. I was enthusiastic because I dreamed of a future where a researcher who wants to know information to look into the claims which the research the protein they want to know something about wouldn’t as a first step put the the name of the protein into Google Scholar but would put the name into Wikidata to get his first overview.

For many items I would expect that the researcher would be greeted with some verified data and a few claims derived by WikiFactMine that he could approve or reject. Given that the researcher is at that moment actually interested in the protein some researchers would be happy to look into the claims which the research the protein they want to know something about.

Unfortunately, doing good automatic text extraction is still hard. As a result the results the WikiFactMine produced don’t encourage it to put the resulting data inside the Primary Source tool. As far as I understand the concern is about the fact that the primary sources tool doesn’t offer the user the justification for statement but just gives him the choice to approve or reject. From their perspective a Wikidata Game that can actually show the user the full justification is better than the primary source tool which doesn’t show the user the justification.

The problem with the Wikidata Game is that the Wikidata Game doesn’t alert our scientist who’s interested in protein XY when he browses protein XY that there are statements that relate to protein XY contained in the Wikidata Game.

The same problem exists with the merge game. When I browse an item that’s on a merge list, that’s the ideal moment for me to review whether the merge makes sense but the Wikidata Game can’t tell me.

I could also imagine a interface where I specify with a SPARQL query a list of items where I would be interested to solve any Wikidata Game that wants to add information to it. While studying bioinformatics I once had playing a round of Fold.it as a homework assignment I could imagine a university professor handing out a homework assignment to go through a list of items of particular interest with a the WikiFactMine game and decide which claims are true and which aren’t.

The ability to select with specific SPARQL queries which items are of interest would make it a lot more easy to customize the task for the interest of a particular user who wants to review claims. That’s true for merge candidates in the same way it’s true for WikiFactMine or other interesting data that needs human verification.

Most of what I wrote above was about dealing better with data. In addition to dealing with data, dealing with people is also important. If we get a lot of new people who do some edits but who don’t feel like they are part of the Wikidata community, we have the problem that we lack decision making that requires multiple people to cooperate.

The lower edit per page ratio of Wikidata than Wikidata reduces the likelihood that people come into contact through having edited the same page and discussing with each other on it’s talk page about who the item should be handled.

One solution that I find promising is to make our Wikiprojects more prominent. To that length I worked to add participant lists to Wikiprojects that didn’t had them and the creation of participant lists to be by default a part of a new Wikiproject.

I created the property “Wikidata project” (P4570) to link from more places to our Wikiprojects to make it easier for users to discover new Wikiprojects. I’m also working on creating more user boxes to create yet another way to make our Wikiprojects visible.

Another way to get more people to interact of discussion pages is to see that discussion pages of central items and properties show up on more watchlists. One way to encourage that is to add by not only the item on which an edit is done but also the used property and the item towards which the statements points to the watchlist as I describe in my wishlist request: https://meta.wikimedia.org/w/index.php?title=2017_Community_Wishlist_Survey/Wikidata/Automatically_follow_the_target_items_of_statements_and_the_property_that_gets_used&action=view

This has also the added benefit that it increases our abilities to fight vandalism of those important pages.

With growing data there’s another task besides creating merging candidates that could need AI help, scoring item notability would be useful.

On the one extreme, having a way to list all items with zero notability which we might want to delete can be useful.

On the other extreme, it’s important to know which items are the most notable to sort through big piles of data when we have multiple people who share the same name. There are SPARQL queries where it would be useful to only list people of a certain notability. The article placeholder tool would be a further beneficent of a notability score.
Having a Wikidata derived notability score for academic papers could have interesting effects when academics are able to refer to those numbers.

I don’t consider developing such a score to be a top priority at the moment, but if there’s a person who needs an AI research topic it’s an open problems that seems to be more useful to be solved than rating item quality automatically.

Lydia recently spoke about wanting to develop functionality that allows institutions to sign data they contribute and I approve of that idea as valuable target for development resources, but I don’t want to go more into it for this document.

For the record, also consider development efforts that target the better integration with Wikipedia/WikiCommons/WikiDictionary valuable targets for development resources and I don’t want my suggestions about what great things could be done inside Wikidata to be understood as discouragement for those efforts.

Another problem to scaling is that currently a lot of our policy is implicit and rests on agreements from individual discussions that can’t be easily understood by new comers. Lack of policy also makes it hard to end certain discussions because there’s no clear criteria. Quite often our structures are at the moment to unclear. 
I intent to spend significant energy on writing a few policy pages and Request for Comments to get them accept into general Wikidata policy in the next year. 
I’m also hoping that we can find with Wikiproject Welcome automatic ways to engage new users, to easily integrate them into our community.