Covid-19 Outbreak: Tweet Analysis on Face Masks
‘STOP BUYING MASKS!’ One day after the U.S. reported its second Covid-19 death, health officials continued to plead with the Americans to stop panic-shopping mode. While the Centers for Disease Control and Prevention(C.D.C.) doesn’t recommend that healthy people wear face masks to avoid infection, Chinese authorities encourage people to do so in the face of Covid-19. The contrast stems from the different traditions and habits of westerners and easterners. Behind the recent mask price gouging in the U.S, ordinary people’s attitude towards wearing a mask triggers my curiosity. What do they talk about online, and what’s the overall emotions? Is there any network effect on their remarks? In the following post, I will study the recent mask shortage by conducting tweet analysis.
Objectives
There are two main objectives for this project:
- Get to know people’s attitude towards the epidemic icon-mask
- See if there is a network effect on people’s tweets and find out the heated topics
In terms of methodology, I will use data mining techniques, sentiment analysis, and network analysis to achieve the above two objectives.
Data Understanding
With a Twitter developer account, I extracted 1,200 tweets containing the keyword ‘mask’ from the website. As shown below, the raw data has 16 columns and 1,200 rows. While the first column contains the content of a tweet, the other columns describe information like engagement, time and user ID, etc.
Data Cleaning
As shown in Pic 1, It’s hard to analyze the original messy corpus. By using the NLP package in R, I removed punctuation, numbers, stopwords, URL, and white spaces from the corpus. Additionally, as the tweets were pulled out using the keyword ‘mask’, I am more concerned with the words appear together with it instead of itself. Thus, I created a dictionary including unnecessary words such as ‘facemask’ and ‘masks’ and cleaned it from the text.
It’s also important to know the complexity of text mining. One word can have multiple meanings. Masks can also be related to topics like skincare, which is unrelated to my studies. I will solve this problem in the next step.
Data Exploration
The data visualization allows us to get some preliminary understanding.
Unsurprisingly, people’s tweets centered on the epidemic Covid-19. The words like ‘need,’ ‘important,’ ‘planning’ give us a glimpse at their attitude. They also talk about the imbalanced supply and demand in the current market. Noticeably, there are two weird terms on the right-hand side, ‘Bremner’ and ‘Movie.’ It’s a topic about an incoming movie called M.A.S.K. written by Chris Bremner. Through the below word cloud, I saw more unnecessary words like ‘allnighter.’
Now unrelated words come up intuitively. I went back and deleted all these tweets and proceeded with the remaining 1,127 tweets.
Sentiment Analysis
Every word has emotion behind it. The ‘syuzhet’ package in R help to capture people’s emotions in text.
The above plot shows that tweets are more related to anticipation and trust. People do have some negative feelings on Twitter. The bar chart shows that around 30% of people have expressed fears online, and about 50% of people have demonstrated negative emotions. But overall, people on Twitter are optimistic about the epidemic icon-mask.
Network Analysis
While sentiment analysis helps to learn individuals’ attitudes, network analysis identifies relationships on social platforms. In tweet analysis, each word is a vertex; the degree of vertex shows its connection with other words. For example, we can tell from the below that ‘coronavirus’ usually appears together with ‘get.’
From the below histogram, the right skewness indicates small degree values for most tweets. There are also some extreme values on the right tail, meaning some terms have close connections with others.
What are those popular terms? The network graphs below provide a cleaner look. To avoid messy display, I only included terms having frequency more than 30.
The connected terms are those that appear together on Twitter. The word ‘coronavirus’ is at the center of network graphs, related to all the other terms. Then I clustered all the words based on edge betweenness.
Betweenness represents how frequently a node is between other nodes’ geodesic paths. The three clusters are mainly about the ongoing epidemic, mask importance, and mask usage.
After seeing the relationship between terms, I moved on to the network impact on tweets.
The above plot shows the distribution of tweets. We can see that many tweets have no connections (discrete points in the sparse area). As the tweets with high engagement are more of interest, I removed those less connected ones and got a more detailed network graph as below.
The numbers above represent the ID of tweets in raw data. Tweets in the two dense areas are most frequently liked, reposted, and commented. I then randomly picked some of the tweets from the circled areas to see what people are talking about masks on Twitter.
As these tweets triggered most discussions online, we now know people’s primary concerns:
- They are not sure who needs to wear a mask to minimize infection risk.
- They are hesitant about the original travel plan.
- They pay attention to the policy changes in Covid-19 prevention.
- They remind people who show symptoms to take action.
- The panic shopping mode for masks is still going on.
Deployment
The critical question is about “so what.” Although text mining is a relatively new area of computer science, it has gradually been applied to fields like risk management and customer care service. The Covid-19 outbreak has just started in the U.S.; at this point, there are two central deployments for the text mining technique.
- Retailers and suppliers can get to know people’s changing attitudes along the time, adjusting inventory and production plans accordingly.
- Authorities can know people’s concerns and uncertainly, giving clear directions and enact new policies beneficial to people.
Finally, there are two improving points for my study. Firstly, one major problem of ‘syuzhet’ package is that it does not properly consider negatives, which may have some impact on sensitivity analysis. Secondly, the number of extracted tweets can be more to capture the whole picture online.