Authors: Sungkyu Shaun Park, Sungwon Han, Jeongwook Kim, Meeyoung Cha (Institute for Basic Science, South Korea & KAIST), Mir Majid Molaie, Jiyoung Han, Wonjae Lee (KAIST), Hoang Dieu Vu (HUST, Vietnam), Karandeep Singh (Institute for Basic Science).
The novel coronavirus pandemic (COVID-19) has affected global health and the economy, and it has become a crucial topic on online platforms. The public discourse logged on online platforms helps us understand the concerns, risks, and strategies of people coping with the virus. When it comes to the quality of information, there have been numerous reports on the spread of misinformation (i.e., wrong by mistake) as well as disinformation (i.e., deliberately false). The abundance of such misleading content can be overwhelming to individuals in determining what information and guidelines to follow. The WHO has even called this phenomenon infodemic. (Reference: Coronavirus Disease 2019 (COVID-19) Situation Report)
This article focuses on the COVID19 public discourse in four Asian countries, South Korea, Iran, Vietnam, and India. These countries have suffered from the virus at different times and scales, which allows us to compare the temporal evolution of topics in a systematic manner. This report shows a high-level summary of our findings and the full report has been released at arXiv.
- This research characterizes risk communication patterns by analyzing the public discourse on the novel coronavirus from four Asian countries: South Korea, Iran, Vietnam, and India, which suffered the outbreak to different degrees.
- The temporal analysis shows that the official epidemic phases issued by governments do not match well with the online attention on COVID-19. This finding calls for a need to analyze the public discourse by new measures, such as topical dynamics.
- Here, we propose an automatic method to detect topical phase transitions and compare similarities in major topics across these countries over time. We examine the time lag difference between social media attention and confirmed patient counts. For dynamics, we find an inverse relationship between the tweet count and topical diversity.
We aim to discern what people say in the wild. For instance, if we could identify a particular type of misinformation that is prevalent in only a handful of countries first, then we could inform people in other countries before the misinformation becomes a dominant topic and poses a crucial issue on public health of those countries. In this light, we have set up the following research questions.
- Can official epidemic phases issued by governments reflect the online interaction patterns?
- How to automatically divide topical phases based on a bottom-up approach?
- What are the major topics corresponding to each topical phase?
- What are the unique traits of the topical trends by country, and are there any notable online communicative characteristics that can be shared among those countries?
We have crawled the Twitter dataset by using the existing Twint Python library. In particular, we have focused on South Korea, Iran, Vietnam in this research. These countries are all located in Asia, and therefore we may control covariates like major differences among Western and Asian countries. In the meantime, the three countries all place unique characteristics in terms of dealing with the current outbreak.
We have set up two keywords, “Corona” and “Wuhan pneumonia,” in general, to crawl tweets to find exact keywords used for crawling tweets for each country) and collected tweets for the three-month period from January to March 2020.
Pipeline for Detecting Topical Phases then Extracting Topics
Our pipeline includes the following four modules to eventually extract and label major topics for certain phases as shown in Figure 1. Please check the pre-print version of the paper to find the details of each module. We have repeated the process for the aforementioned four languages.
Basic Daily Trends
We depict the daily trends by plotting the daily number of tweets, and the daily number of the COVID-19 confirmed cases simultaneously. Adding to the two trends, we include official epidemic phases announced by each government as vertical lines (see Figure 2). By seeing the tweet and confirmed case trends together, we could confirm that the tweet trends are somewhat associated with the confirmed case trends. However, the official epidemic phases do not explain the tweet trends well.
The first confirmed case was identified on January 20, 2020. From early January till January 20, the daily numbers of tweets were relatively small, whereas the number sharply increased on January 25, as depicted in Figure 3. January 25 was the date when the Korean government increased the travel warning level on Wuhan city and Hubei province to suggest to evacuate from there, and this sign may affect the communication on Twitter. On February 18, the number sharply increased that had not been shown before, and it may be due to the 31st confirmed case related to a cult religious group in Daegu city. After the 31st confirmed case has been found, the quarantine authority tried rigorous testing focusing on Daegu, and the number of the confirmed cases was drastically increasing until mid-March. The tweet trends also follow the same pattern. However, the official epidemic phases announced by the government, divided by the vertical dash lines in the figure, seem lag from the increasing number of tweets, and therefore we could say that the epidemic phases may not explain well enough the online communication trends in Korea.
On February 19, two people tested positive for SARS-CoV-2 in the city of Qom. After this date, we see a significant surge in the number of tweets and it reaches a peak in a few days. On February 23, the government changed the alert from white to yellow. Although the number of confirmed cases keeps increasing, the number of tweets starts to decrease gradually with a little fluctuation as shown in Figure 4. Therefore, the trends of these two numbers show different patterns in contrast to Korean tweets. In the meantime, the government gradually increased preventive measures, and a number of cities with the highest rate of infection were announced hot spots or red zones. Overall, they didn’t place the whole country under the red alert. However, the government announced new guidance and banned all trips on 25 March. The president, on 28 March, said that 20 percent of the country’s annual budget would be allocated to fight the virus, which might be implicitly a sign of the red alert.
On January 23, 2020, Vietnam officially confirmed the first two COVID-19 patients, who come from Wuhan, China. After that, the number of tweets increased sharply and reached to peak in early February as shown in Figure 5. Although a few new cases were detected, the number of tweets tended to decrease and remained stable. On the second half of February, there are no new cases, however, the number of tweets increased rapidly and create a new peak. This peak could not remain for a long time. This trend can be explained by two possible reasons. The first is that the pandemic has spread over the world. The second is that the last cases in Vietnam were treated successfully. After a long time with no new cases, Vietnam had constantly confirmed new cases in Hanoi and many other cities from March 6. The number of tweets of this phase increased again and remain stable at a relatively higher level than the initial phase.
The first case of COVID-19 was confirmed on January 30, 2020. The number of cases quickly rose to three on account of students returning from the city of Wuhan, China. Throughout February, no new cases were reported and the first weeks of March also saw a relatively low number of cases. The number of cases however picked up numbers from the fourth week of March, notable were the 14 confirmed cases of Italian tourists in the Rajasthan province. This eventually led to the government of India declaring a complete lock-down of the country. The daily number of tweets followed a similar trend as that of the number of cases as depicted in Figure 6. First confirmed cases around January 30, 2020, caused a sudden spike in the number of tweets, that subsided in February. First COVID-19 fatality on March 12 and some other COVID-19 local events led to an exponential increase in the number of tweets. The tweets peaked on March 22 when the government declared lock-down of areas with infected cases and started trending downwards after that. It is strange that the declaration of nationwide lock-down by the government on March 24 only caused a small spike in the number of tweets and trend continued downwards. However, March 31 saw a large spike in the number of tweets owing to confirmation of mass infections in a religious gathering. Overall, the tweet trends seem to be synonymous with the release of official information by the government (e.g., number of confirmed cases, fatalities on COVID-19.)
Extracted Topical Trends
We have summed the theme labels acquired from the ‘Label Topics’ module as a daily basis and analyzed the topic changes across time with the three types of plots for the three target countries below: The first plot shows the daily topical trends based on proportions; the second shows the trends based on the number of tweets; the third shows the trends based on the number of tweets that country names like the U.S where explicitly mentioned. Overall, as people talk more on the COVID-19 outbreak (i.e., the daily # of tweets increases), the topics people talk about become less diverse.
We have derived a total of four topical phases and plotted daily topical proportions as well as daily topical frequencies (see Figure 7-top and -mid). At first, there was no related topic on Phase 0. Then from Phase 1 to Phase 3, the number of topics diverged as 8, 5, and 11. On Phase 1, people talk much on personal thoughts and opinions linked to the current outbreak, and also they cheered up each other. On Phase 2, as the crisis going up to its peak, people talked less on personal issues and mainly talked on political and celebrity issues. The political issues were about shutting down the borders of South Korea towards China and of other countries towards Korea. On Phase 3, as the daily number of tweets becomes smaller than Phase 2, people tended to talk on more diverse topics including local and global news. In particular, people worried about hate crimes happened towards Asians in Western countries. People might be interested in different subjects as they think the crisis seems to be off the peak.
Figure 8-top and -mid illustrate two topical phases, their proportions, and daily topical frequencies in Farsi tweets. Phase 0 includes global news about China as well as unconfirmed local news that reflects the fear of virus spread in the country. Political issues form a remarkable portion of tweets in this phase, as the country has been struggling with various internal and external conflicts in recent years, and also, there was a congressional election in Iran. In Phase 1, a significant increase in the number of tweets takes place, where local news regarding the virus outbreak constitutes the majority. An intriguing finding is that informational tweets about preventive measurements overshadow global news, which can be explained by the sociology of disaster that when people in a less developed country are at risk they naturally tend to share more information. However, political tweets are still widespread because of the aforementioned reasons and public dissatisfaction about the government response to the epidemic. This fact is also highlighted in Figure 8-bottom that after Iran and China, the US is the most mentioned name. One possible explanation is that the outbreak puts another strain on the frail relationship between Iran and the US.
There are six topical phases with Vietnam and they are visualized as in Figure 9-top and -mid. Phase 0 totally related to global news because in this period, Vietnam did not have any cases. From phase 1 to phase 5, topics diverged separately but they focused on local news except phase 3. Phase 3 is the phase when no new cases in Vietnam were detected. We can see a common point of phase 0 and phase 3 is no new cases in Vietnam (local news) so tweets tended to talk more about global news. Especially, in phase 3, we can see the increase of personal topics that most did not have in other phases. It was because a conflict event that related to Korean visitors made a huge of personal tweets.
Next, we show the number of tweets that mentioned countries as in Figure 9-bottom. The most three countries mentioned are Vietnam, Korea, and China. Vietnam and China were mentioned frequently across phases because Vietnam is the local and China is the original place of the pandemic. Besides, Korea was mentioned in a large number of tweets but they only concentrated on Phase 3. This is totally similar to topics changes due to the Korean visitor event in Vietnam.
We have established three topical phases for tweets in Hindi in India (Figure 10-top and -mid). In the starting phase, the tweets are focused on sharing information about COVID-19, and global news about COVID-19 in China. People want to share the news about COVID-19 and information on how to be safe. Thereafter in Phase 1, the number of topics become more diverse. Although a large portion of the topics is concerned with information about the virus and global news, especially China, a major portion is formed by rumors or misinformation. The number of tweets spike on January 30, 2020, when the first case was confirmed in India. Towards the end of Phase 1, there is a further spike in the number of tweets, primarily due to the beginning of announcements of some measures by the government to contain the virus (such as halting issuing new Visas to India). Lastly, in phase 3, a huge spike in the number of tweets is witnessed. The proportion of informational tweets decreases, whereas local news tweets confirming new cases increase. Regrettably, a marked portion of the tweets still consists of hateful content and misinformation. Interestingly enough, although the situation continued to worsen, tweets with people expressing dissatisfaction with the government are negligible. Phase 3 also witnesses mentions of other countries, especially Brazil and Europe, in addition to China and understandably, India. This could be attributed to a growing number of cases in Italy and Spain, Brazil, as well as the news surrounding the use of Hydroxychloroquine in Brazil. The U.S. also finds considerable mention due to the same reasons.
We have analyzed tweets in order to understand what people are actually talking about related to the COVID-19 pandemic. In South Korea, the daily numbers of tweets tend to reach their peak as they are synchronized with sudden offline events. However, in case of Iran and Vietnam, the daily numbers of tweets tend to be not well synced with the offline events as in Iran, the government strongly control the online and offline media outlets and in Vietnam, people do not use Twitter much so the tweet trends may not resonate the actual flow of the public opinions. In all countries, we conclude that the epidemic phases or the national disaster stage announced by the governments did not well match the actual public opinion flows on social media, and therefore, we explore the topical phases which resonate with the flow of the public opinions with a bottom-up approach.
After extracting the topical phases, which those numbers were 4 in South Korea, 2 in Iran, and 6 in Vietnam, respectively, we have used the LDA and found the optimum number of topics for each topical phase and then labeled the corresponding themes for each derived topic. In general, as people talk more about COVID-19, the topics they refer to tend to be concentrated in a small number. This observation could become clearer if we consider the tweet depth value by phase. Tweet depth is defined as the number of retweets per day divided by the number of tweets per day. It can be deemed as a standardized cascading depth, and therefore, the larger value means the greater extent the depth for one tweet. From the case from South Korea and Vietnam, we could verify the observation as tweet depth also tends to get larger when people communicate more on COVID-19. However, For the Iran and India case, the number of phases were too small to observe the general characteristics of the topical trends.
Moreover, once the daily tweet volume has its highest peak then the forthcoming trend tends to go down in every country as shown in Figure 3–6. In this light, we observe that for some countries, the peak of the daily tweet trend precedes the peak of the daily confirmed case up to a few weeks, whereas for other countries, the two peaks are close to each other. No countries showed that the peak of the daily tweet trend succeeds that of the daily confirmed case.
There are several limitations to be considered. First, we analyzed tweets solely from the four countries, and therefore, we need to be cautious about addressing explanations and insights that can be applied in general. We plan to extend the current study by including more countries. Second, there could be other ways to decide the topical phases. However, our approach can be considered as computing and using unique communication traits (i.e., velocity and acceleration by country) that would be relatively constant by issue, which is the COVID-19 outbreak issue in our case.
Despite the existing limitations, the current research could provide an important implication to fight against Infodemic. We find several topics that were uniquely manifested in the recent pandemic crisis by country. For instance, we could discover the emergence of misinformation on Hindi tweets. Our findings shed light on understanding public concerns and misconceptions under the crisis and therefore can be helpful in determining which misinformation to be discredited. This attempt may help eventually defeat the disease.