Scribe: Our project is kicking off

We are happy to announce that the development of Scribe officially started! Scribe is a project funded by the Wikimedia Foundation during the years 2019/2020 to build a tool that aims at helping new editors of under-resourced languages to start writing high-quality articles that confirm with their language standards. Reach out to us, if you are interested in a research collaboration on any of the topics described below.

Wikipedia, one of the most visited and read websites in the world, is a widely used resource to access information about a variety of topics. However, while a few language versions cover a variety of topics, other languages are barely supported. A large part of the world’s population only has limited access to information. This information poverty affects readers of Wikipedia strongly, as they are not able to find information on a variety of topics. To create a more diverse encyclopedia, Wikipedia has to retain new editors and support existing editors in creating articles.

The project we are developing, Scribe, aims at supporting new editors retention. We tackle some of the challenges newcomer editors face while creating their first articles. From an editor’s perspective, our tools work in the following way: An editor decides to create a new article on a topic of their choice. When starting to write the article, they are able to choose Scribe as a supporting tool. Scribe suggests them a structure of an article, i.e., section headings, and references.

For example, “Marta” your favorite athlete became notable now and it is the time to write a Wikipedia article in your language about her. There isn’t an English Wikipedia article about her yet that you can translate as she might not be that notable to the English speaking Wikipedia. So you don’t know where to start.

Scribe would first suggest you the document structure such as
(1) Club career, (2) International career, (3) Honours, (4) Style of play, and (5) Personal life
For each of these sections, a user would be provided with a list of possible references to cite from. For instance, the following would be suggested as a reference:

Marta first gained widespread notice during her time with Umeå, which she led to the 2004 Union of European Football Associations (UEFA) Women’s Cup title and helped to reach the finals...“ Adam Augustyn, Britannica, Jul 26, 2019.

Each reference will be displayed as a list of summarized key points, so a user gets a fast overview whether they would like to check the reference more in-depth.

In this post, we want to give you an overview of the research topics we will tackle. We are looking forward to collaborating with other researchers, software developers and of course the Wikipedia communities themselves.

Active Research topics within Scribe

We split the work into three major challenges and will guide you along those in the following:

  • Document planning
  • Reference retrieval
  • Extractive summarization
  • Computer-supported cooperative work

A large part of Wikipedia articles already have headings for each section, such as “Early Life”, “Career”, etc., for a biography.

By exploiting Wikidata’s structured data we can retrieve similar articles to the one under creation and suggest a document article’s structure based on the existing articles’ section headers.

Creating Wikipedia article structures is discussed by Sauper et al. [1] and Piccardi et al. [2]. Calculating similarity scores between knowledge base entities is a well-studied area of research mainly used for knowledge base completion [3][4][5]. While those techniques work well for the most notable, well-connected knowledge base entities, they have not been tested yet for the long tail of with very few triples. We believe this might be the case for Wikidata entities with no Wikipedia articles yet. This proposes one more interesting research challenge to work on during the course of this project.

This section contains multiple challenges at once: (1) How do we understand which concept the user wants to write about? (2) Where do we get references from? and (3) How can we select references that are appropriate and of high quality?

How do we understand which concept the user wants to write about?

When a user wants to write about apple do they mean the fruit or the company? Using Wikidata, we disambiguate which concept is meant and exploit the existing statements and their references.

Where do we get references from?

To select references, we need a source from which to select those references. This means, we need to reuse or create an index over web pages, that are potential candidates for a reference. Here are multiple challenges to consider: (1) using an existing web index (2) online or offline? As Wikipedia’s page request numbers are high, we have to consider the load on any web index. A solution could be an offline index. Our tool might not react as fast in updating to recent events (which is possibly secondary for an encyclopaedia) but faster in reacting to the users’ requests.

How can we select references that are appropriate and of high quality?

This is a challenging task by itself, however, as a starting point, one can rely on URLs that have been used in Wikipedia before and assign them a trust score. We give higher weights to resources already used in the target language so that we don’t create a bias towards the big Wikipedia communities’ languages (English, German and others). However, we believe that there is a lot of room for interesting research in the field of selecting trustworthy references, that we do not yet cover.

Finally, to display the references to a user, we need to give them an overview of the reference that we suggest with the previous mechanisms. We decided to work on summarizing of the references to give the user an easy way to decide whether the content makes the reference appropriate for the article they are writing and to avoid as much possible that users accidentally copy the existing text. This opens the research field of deciding which parts of an article are important for this Wikipedia’s article text and how to summarize them in the most appropriate way. In natural language processing research, there has been a long line of work for summarization techniques. During the course of this project, we are more interested in exploring unsupervised extractive techniques for summarization [6] as we believe they are more suitable for under-served language scenarios while still keeping a high standard of factual accuracy in the generated summaries.

We take the integration of the community very seriously. We started working on a series of interviews about how editors work now, to get an insight into their processes and tools used. Further, we will work on community studies for each iteration of our project, to assert that we work as close on the communities’ needs as possible. This has been our focus from discussing the project to the design of each implementation step and we believe this is the way to achieve the best results. Therefore, we are also interested in collaboration in the areas of HCI, especially with a focus on computer-supported cooperative work.

Finally, we believe that Scribe can make an impact on the lives of many people. It creates diversity in Wikipedia by enabling a wider audience to edit to the project, aiming to close the information gap produced by the lack of editors from large parts of the world. At large, this will create a more balanced internet, that enables speakers of any language to access information easily. The first step for this we are doing with Scribe: Lowering the threshold to contributing to one of the most used websites in the world.


1 — Sauper, Christina; Barzilay, Regina (2009). “Automatically Generating Wikipedia Articles: A Structure-Aware Approach”. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP.

2 — Piccardi, Tiziano; Catasta, Michele (2018). “Structuring Wikipedia Articles with Section Recommendations”. SIGIR ’18 The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.

3 — Bordes, Antoine (2013). “Translating Embeddings for Modeling Multi-relational Data”. Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013.

4 — Lin, Yankai (2015). “Learning Entity and Relation Embeddings for Knowledge Graph Completion”. Proceedings of the Twenty-Ninth (AAAI) Conference on Artificial Intelligence.

5 — Nickel, Maximilian (2016). “A Review of Relational Machine Learning for Knowledge Graphs”. Proceedings of the IEEE. doi:10.1109/JPROC.2015.2483592.

6 — Rada Mihalcea, Paul Tarau: TextRank: Bringing Order into Text. EMNLP 2004: 404–411


Finding and Creating High-Quality Content in Under-Served…