Game for improving Wikipedia quality

WikiBest screenshot

Recently, a beta version of the online-game WikiBest was released. The service is a part of the research on data quality in Wikipedia. Game allows to compare the quality of data in various language versions of Wikipedia.

Despite its popularity, Wikipedia is often criticized for its poor quality. In the scientific world, there are various approaches to the automatic assessment of the articles quality in this free encyclopedia. However, a large number of problems still remain to be solved. For example, how to automatically evaluate or compare the quality of individual facts in different language versions on the same topic?

In Wikipedia, each article can have several language versions (even more than 200). On the one hand, it simplifies access to information to individual language communities. On the other hand, this can create difficulties in determining better information, because each of these versions can be created and edited independently of each other. For example, readers and editors of the English version of the article on Yekaterinburg do not need to know what is written about this city in the Russian version of Wikipedia, although it can be expected that the information may be of better quality in the that second (of course, not in all cases this rule works; )).

The WikiBest game is designed to build algorithms for automatically comparing the quality of data between individual language versions of articles based on the decisions of users (players). To build such models, machine learning and artificial intelligence techniques with additional measures will be used. It can help to choose more complete, credibile and timeliness information that could enrich other language versions of Wikipedia.

Website of the game: WikiBest.net

The first short video lecture on how WikiBest works:


Main Features

Currently, the minimum requirements for the player — the knowledge of 4 languages (Russian, Ukrainian, Polish, English) at a basic level that would allow you to compare the contents of infoboxes (in simplification — tables with data) in Wikipedia articles. The knowledge of Belarusian is also recommended — then there will be an opportunity to compare the quality in all available 5 language versions.To participate in the game you need to register. After receiving the activation code on the mail — you can start to “struggle” for quality in Wikipedia! ;)

On the screen appear infoboxes in 5 (4) language versions on the same topic — for example, it could be a city, computer game, university, company or other object. It is possible to move windows with infoboxes for convenience. For each language version it is possible to mark four options regarding the data they contained: better quality, better completeness, better credibilyty, better timeliness.

Ideally, each of the available options should be marked only once within 5 (4) languages. Those we must determine which is the best in each of the four “nominations”. However, there are exceptional cases when the best can be two language versions at once. Then the game offers the player to add a comment, with information about why he (she) thinks so.

To go to the next five (four) cards, click “Next”. And repeat according to the scheme described above.

For the work done, the players “earns” experience, which leads to higher levels.

Due to the fact that the research is carried out mainly by specialists in machine learning and data analysis, gaming service is not a strong point of this project;) It still has to learn. I will be glad to refer to useful materials in this direction. Generally speaking, the project is non-commercial. Any support is welcome)


A bit of theory

What is the quality of the data? The question is not simple, and the scientific community does not have a single definition — it all depends on the context;) To begin with, quality assessment is a subjective concept and depends on the individual, his knowledge and experience, and the demand for this information at a given time. Simply put, the quality of the data can be defined as suitability for use.

In order to evaluate the data quality, it is also necessary to take into account its various dimensions, such as, for example, completeness, timeliness, credibility.

In the WikiBest game, completeness means how widely the object is described. Those it is necessary to see what characteristics are inscribed in the infobox — whether all the basic parameters for this object are available to the reader. For example, if this is a city, one of the most important parameters can be: population, area, mayor, etc.

The timeliness is related to the difference between the entered parameters of the object and the actual state of affairs. For example, a higher relevance of population data will have an infobox where the value is showed as of 2018, compared to the infobox where the same parameter has value from 2016.

Credibility in the context of the game, shows how much information is backed up by reliable sources. Thus, the reader can check the correctness of the inscribed value of a particular parameter.


Why 5 languages?

As already mentioned above, the game is part of scientific research, in which I take a direct part. I can be sure of the basic knowledge of these languages, so I can conduct research on the data.

As for the non-binding of the Belarusian language, this is due to the size of the Belarusian edition of Wikipedia. Currently there are approx. 150 thousand articles. For comparison, the Ukrainian Wiki already contains more than 800 thousand, the Russian — almost 1.5 million (source).

The main goal of scientific research is to enrich the less developed language versions of Wikipedia. In this sense, the Belarusian edition has a great potential — there can be transferred a lot of data from other studied language versions. However, we already know that the quality of the data depends on the topic and the language version, so first we need to determine the “candidate” for “copying” (in fact, you still need to translate this data — but this is not a problem when using semantics).