Open data and COVID-19: Language diversity on Wikipedia
Authors: , Sun Geng @tosungen , Meeyoung Cha (Institute for Basic Science, South Korea & KAIST), Inho Hong (Center for Humans & Machines, Max Planck Institute for Human Development, Germany), Diego Saez-Trumper (Wikimedia Foundation).
During the COVID-19 pandemic, people from around the world continue to turn to Wikipedia for vital information about the public health crisis. To meet this demand, Wikipedia volunteers have contributed over hundreds of thousand of edits dedicated to explaining the novel coronavirus, tracking regional cases, linking people involved, as well as discussing the socio-economic impact. Our three-part blog series analyzes the readership trends and insights that might help us understand how Wikipedia readers navigate through the wealth of COVID-related information on the site.
In our first blog, we analyzed how readers navigated through different COVID-19-related topics on English Wikipedia pages. We also explored how readers’ attention changed overtime as the pandemic spread across the globe.
In our second installment, we look beyond English Wikipedia to offer analysis on Chinese, Korean, and Italian language articles as COVID-19 Outbreaks took place in order in China, South Korea and Italy. Chinese Wikipedia is restricted from mainland China. However, access to Wikipedia from outside the mainland China would also be meaningful. We hope that by taking a deep dive into readership trends across different language Wikipedias, we can form a better understanding of the topics that are most important in these languages.
Executive Summary:
- Generally, the more the virus spread in a given location, the more people turned to COVID-19-related Wikipedia pages in the location’s respective language (e.g., as the Italian cases went up, so did pageviews to Italian-language Wikipedia articles about COVID).
- The most popular categories of COVID-19-related Wikipedia pages differ by language. For instance, celebrity biographies of people who tested positive for the virus seem to be popular among English and Italian readers, but less so with Chinese and Korean readers.
- At the start of community spread in places where these languages are spoken, the pages about the virus itself generated a lot of traffic in those languages. As the number of cases increased, readers sought information specific to those regions. .
- Compared to information about the people and regions affected by COVID-19 on Wikipedia, there is a substantive gap in the quantity of information specifically about the virus on the most-read Wikipedia pages by language. This gap calls for a more rapid translation of content to better support Wikipedia readers.
Daily views by language
We first examined the aggregate demands on Wikipedia pages. Figures 1a-d show the total view counts of Wikipedia pages for a specific language, along with the growth rate of COVID-19 cases in the United States, China, South Korea, and Italy. While language alone is not the determining factor of the readers’ geographic location, we used language as a secondary indicator to understand traffic trends to Wikipedia.
We found that the daily growth rate of COVID-19 infections fluctuates during the initial phase of the pandemic (marked as “Initial Growth” in Figures 1a-d) due to the relatively small numbers of infections. After some time, typically when 20–25 people have been diagnosed with confirmed cases of COVID-19 , transient fluctuations disappear and a more steady trend emerges (marked as “Growth rate”). Chinese language articles show a steady growth rate early in the epidemic due to a large number of infections tracked in late January.
These figures show a high correlation in the growth rate of the virus and the aggregate daily views on COVID-19 Wikipedia pages. A high-rise growth rate is closely followed by high pageviews on Wikipedia. Sometimes this association happens immediately (e.g., Figure 1(c)) or takes several days of delay (e.g., Figure 1(d)). As the virus infection growth rate rises or falls, so does Wikipedia view counts. However, the aggregate views on Wikipedia pages remain relatively high throughout the pandemic, reaching tens of thousands to even millions of daily views, only to decrease gradually over time.
Despite the unstable connection to Wikipedia in mainland China, the views here are still notable since we use language as a proxy of nationality and at the beginning of the epidemic report, most of the discussion, whether at home or abroad, is also concentrated in the Chinese context.
The aggregate traffic to Chinese COVID-19 Wikipedia pages remained rather steady throughout the period, despite the prominent decrease in the growth rate of infections. In China’s growth rate and the Chinese Wikipedia view graph, the growth rate initially increases (on January 17th) and the Wikipedia views show a sudden rise 3 days later (on January 20th). The black vertical line represents the date of Wuhan’s lockdown on January 23th.
Zh: https://public.flourish.studio/visualisation/2175414/
Content types by language
Previously, we found that COVID-19 related Wikipedia pages could be grouped into four major topics:
- Virus: Wiki pages that directly cover topics on the virus itself (such as “Coronavirus disease 2019”), developments on tests and vaccines (e.g., “COVID-19 vaccine”), and symptoms (e.g., “Severe acute respiratory syndrome coronavirus 2”) belong to this category.
- Region: Tracking pages dedicated to specific regions were quickly created as outbreaks spread globally (e.g., “2020 coronavirus pandemic in New York (state)”).
- People: Celebrities and public figures who are related to COVID-19 either as spokespersons, doctors, or as infected patients were grouped as the people category.
- Others : The remaining Wikipedia pages were grouped as Other topics, which included the discussion on the socio-economic impact of COVID-19 (such as the “2020 stock market crash”).
Maintaining the same topical categories (i.e., Virus, Region, People, and Others), we classified the Wikipedia pages in English, Chinese, Korean, and Italian and examined the aggregate share of view counts of each topic received in a given language. We found that the most popular category in English, Korean, and Italian Wikipedia pages was the Region category, whereas it was the Virus category in Chinese Wikipedia pages. We may attribute this difference to the fact that Wikipedia is blocked in China and hence accessible only by proxy servers. Hence, the reasons for visiting Wikipedia for Chinese speaking users may be different from others. For Chinese Wikipedia pages, traffic demand on the Virus category of articles is far higher than the Region pages. Whereas for all other languages, the Virus pages are initially in its highest demand, but quickly turned over by the Region pages, indicating that people’s attention shifts quickly to the corresponding country’s pandemic situation.
Another notable finding is the proportion of the People pages. Wikipedia pages on public figures who are associated with COVID-19 were in significant demand for English and Italian, accounting for 16.8% and 13.8% of the total views, respectively. However, the People pages attracted far less attention of 7.9% and 4.4% of total views for Chinese and Korean pages. Later, we will examine which public figures make up the top viewed list.
https://public.flourish.studio/visualisation/2108402/
https://public.flourish.studio/visualisation/2105982/
https://public.flourish.studio/visualisation/2105985/
Virus pages by language
Even within the same topical category, we observe commonalities and differences in people’s attention. Table 1 lists the rank of all Virus pages across the four languages. The number of Wikipedia pages about the virus differ by language. We could identify ten Wikipedia pages in English, nine in Chinese, eight in Korean, and six in Italian, at the time of analysis.
The most viewed English Wikipedia page was “COVID-19 pandemic” (hereafter “Pandemic”). The different language versions of this Pandemic page (which has the same Wikidata ID) on the ot also ranked the top in Chinese and Korean, although their page titles differed. In total, there are 124 different language versions. This Pandemic page contains the most considerable amount of information in English based on the byte size of the document. However, when compared to other languages, the corresponding page in Korean had one-tenth of byte size compared to the English version. Likewise, the English version had the largest number of edits and editors when compared to the same translated article in other languages. This difference in content size highlights the possible information gap across languages and the need for rapid translation into many of the languages supported by Wikipedia.
The second most-viewed Wikipedia page in English, “Coronavirus disease 2019,” was 3rd, 2nd, and 2nd in Chinese, Korean, and Italian, respectively. Other Virus pages also showed slight differences in rankings, yet the list overlapped largely. The third-ranked page in English, “Severe acute respiratory syndrome coronavirus 2,” is about the virus itself that causes COVID-19, and the fourth-ranked is a more general page on “Severe acute respiratory syndrome.” These rankings highlight that Wikipedia users are not only interested in the pandemic and phenomenon itself, but also interested in learning the latest scientific details about the virus, testing, and vaccine that are offered at these top-ranked pages.
Among the top pages is “Misinformation related to the 2019–20 coronavirus pandemic,” which exists in Chinese and Korean, but not in Italian. In contrast, topics like “COVID-19 in pregnancy” exist in Italian, but not in Chinese or Korean. Such discrepancies may be due to several reasons, including the availability of bilingual editors t as well as gaps in people’s attention toward the various aspects and impacts of COVID-19.
Popular people pages by language
Finally, we turn our attention to the People category, which can be compared across languages. Table 1 lists the top 10 individual pages that are most viewed in the People category for English, Chinese, Korean, and Italian. In stark difference to the Virus pages, the rank of People pages varies dramatically by language — only two individuals (Tom Hanks and Boris Johnson) are included in the top-10 list for all four languages.
The individuals who comprise the top list have varying occupations, ranging from medical doctors, actors, singers, journalists, and politicians. Many of these individuals have engaged in handling the virus as medical doctors or politicians, and/or have contracted the virus themselves. Pages of individuals who lost their lives (such as Lucia Bosè and Luis Sepúlveda) due to COVID-19 also were viewed frequently. The nationality of the individuals was also diverse. For example, two medical doctors in China, Zhong Nanshan and Li Wenliang, were the first- and fourth-ranked people in English Wikipedia pages. So this means that Wikipedia viewers are interested in the global rhetoric of COVID-19, beyond the national stories. Except for the medical doctors, people who are ranked in the English list work in the United States/Britain. All the Chinese top 10-ranked people are medical doctors from various parts of the world. No one in the Korean Wikipedia list works in or is from South Korea. The Italian Wikipedia list consists of five people originally from or working in Italy and five people from other parts of the world. All the medical doctors on the list are Chinese.
Each language list has a different Job distribution and nationality. For example, celebrities appear 18 times (12 individuals) while politicians 10 times(6 individuals) and medical doctors 6 times (3 individuals).
Looking ahead…
This post showed how Wikipedia serves a diverse audience by comparing view logs from English, Chinese, Korean, and Italian pages. These logs show commonalities as well as differences. During the COVID-19 pandemic, people are seeking high-quality information about the virus, regional updates, relevant people, as well as other socio-economic issues via Wikipedia. This was common across all languages, and the ranking of top Virus pages was similar across languages. However, some topics were more widely available in one language than the other.
While the English Wikipedia pages are the most comprehensive in terms of describing the virus information, the same Wikipedia pages in Chinese, Korean, and Italian contained far less content (e.g., sometimes an order of magnitude fewer bytes).
This discrepancy may be due to the lack of an available pool of Wikipedia editors on the topic as well as potential differences in topical interests by language. Nonetheless, our analysis finds that for the top-ranked Virus pages, there is a need to make critical information more readily and rapidly available in multiple languages. In fact, a study has confirmed that Wikipedia use cases appear more common in countries with certain socio-economic characteristics (e.g., in-depth reading of articles is substantially more common in countries with a low Human Development Index).
In our next post, we will look into how many editors have contributed to the COVID-19 related Wikipedia pages in detail. This process is dynamic, often involving changes in the title, fact-checks, cross-linking, as well as (sometimes) edit wars. We will also cover the dynamics involved in the editing process of COVID-19 Wikipedia pages.
Acknowledgements
, Chantal de Soto and Samantha Lien from Foundation and Keunwoo Kim and Eunhee Jung from IBS, for their contributions editing the article.