Covid-19 Analysis of Twitter Content

Published in

Social Media: Theories, Ethics, and Analytics

9 min readDec 15, 2020

by Tao Yao

Background and Introduction

2020 will be so different from usual since the Pandemic of COVID 19 is spreading worldwide. I will use this pandemic as a background in my project to analyze data in Social Media.

I will collect Tweets data on Twitter and mainly use Python to collect tweets related to COVID 19 or Pandemic to do future analysis. Tweets contain those keywords or hashtags will be collected as a dataset.

I want to use Content Analysis and some statistical tools to analyze those data to see the emotional change for people on Twitter. Then, I want to find the correlation between this Covid19 in social media and depression emotion by using some Machine Learning algorithms. Eventually, I may predict some useful suggestions.

Depression is a complex multi-dimensional, heterogeneous mental disorder. Almost 1 million people were taking their lives each year. According to the World Health Organization, 1 in 13 globally suffers from anxiety. Depression shows emotional symptoms and can also produce physical symptoms and cognitive symptoms, such as deficits in attention, executive function, memory, and reaction speed.

Dataset preparation

In this project, I mainly use two different datasets to do content Analysis and Covid19 Depression Analysis.

For the first dataset, I download 2079 tweets using Covid19 as keywords to form a dataset to see the results. In the data modeling process, we will model the twitter_text contents to analyze the dataset. With the modeling process, we need to do the text vectorization, which is the process of converting text into a numerical representation.

After using the small dataset, it seems to be too vague. When I think and search more in-depth, I want to find the correlation between this Covid19 in social media and the depression phenomenon.

The majority of the language is in English and Spanish.

The distribution of the tweets, retweets, and reply.

The connection of word and the five possible topics in this dataset using LDA model

After some research, I found this dataset. The Sentiment140 dataset contains 1,600,000 tweets extracted using the Twitter API. The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment. It contains the following six fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet

date: the date of the tweet

flag: The query. If there is no query, then this value is NO_QUERY.

user: the user that tweeted

text: the text of the tweet

To generate a new dataset, combining part of the Sentiment140 (8,000 positive tweets) and another one for depressive tweets (2,314 tweets), with a total of 10,314 tweets.

https://www.kaggle.com/kazanova/sentiment140

Experiment and Discovery Process

Since the basic content analysis is not very enough to give some suggestions and giving some feedback from the instructors I am trying to find the correlation between this content and depression. The Covid19 keeps a lot of people stay at home or keep a social distance for a very long time, this might influence people's depression situation. So I am trying to find the difference between the content people post on Social Media about Pandemic and the normal contents. To achieve this, I need to do a Machine Learning Comparison with those who tweet content that contains Covid 19 related keywords, and those who just tweet common contents.

I use a data set of 10,314 tweets divided into depressive tweets (labeled 1) and non-depressive tweets (labeled 0). This will be a large and imbalanced data set when we use it.

First, Creating a model to detect depression in tweets. Detecting Depression in Tweets using Baye’s Theorem. I applied sentiment analysis through a powerful theorem from probability theory called Baye’s Theorem. The model will tell whether a given tweet is depressive or not.

Then I store the text of the tweets into an array called text. The corresponding labels of the tweets will be stored in a separate array called labels.

Implementation Process

Once the network has been trained, I will use it to test tweets crawled from Twitter. To establish the connection between COVID-19 and depression, I will obtain two different data sets. The first data set will contain tweets with keywords related to the coronavirus, such as “COVID-19”, “quarantine”, “pandemic” and “virus”. The second data set will contain random tweets searched with neutral keywords such as “and”, “I”, “the”. The second data set will be used as a control to check the percentage of depressive tweets in a random sample. This will allow us to measure the difference in the percentage of depressive tweets in a random sample with COVID-19 specific tweets.

Before I can get started with training the neural networks, I need to collect and clean the data. These tweets contain a lot of so-called ‘Stopwords’ such as “a”, “the”, “and”, etc. These words are not crucial for classifying a tweet as depressed or non-depressive, so we delete them. We also need to remove the punctuation because it is unnecessary and will only reduce the neural network’s performance.

How to achieve this correlation:

Compare it in a non-covid dataset and a covid19 keywords related dataset. I want to compare the difference.

First, Tokenization of the data. Basically, the neural networks do not understand the raw text as we humans do. Therefore, in order to make the text more palatable to our neural network, we convert it into a series of ones and zeroes.

Then, to tokenize text in Keras, we import the tokenizer class. This class basically makes a dictionary lookup for a set number of unique words in our overall text. Then using the dictionary lookup, Keras allows us to create vectors replace the word with its index value in the dictionary lookup.

The next step is to shuffle the dataset. I need to shuffle the data to allow random samples of tweets to go into the training, validation, and test sets. And now need to split the data into the training, validation, and test sets.

`splitting into train and validate directories`

Then, I use a neutral network to do the analysis.

In this dataset, it provides 1193514 word vectors in total.

Form the model for prediction

The architecture consists of a word embeddings layer followed by two dense layers.

Two layers of model information for vectors

The accuracy of the models on the test set is 0.097112 and a loss of 0.0971. It means the data is not overfitting. It shows this model is good for use to predict whether a tweet is depressive or not.

Results of the depression results

I prepared two different data sets, each with 1000 tweets. The first consists of tweets containing keywords related to corona (such as “COVID-19”, “quarantine” and “pandemic”). In order to obtain a control sample for comparison, I searched for tweets that contained neutral keywords. Using the 1000 tweets in this sample, I formed the second control data set.

Then I feed it to my neural network to predict the percentage of depressive tweets. I ran it many times and finally took the average results.

My model predicts on average that 33% of depressive tweets and 67% of non-depressive tweets in the tweet dataset obtained using neutral keywords. However, the number of suppressed tweets with COVID-related keywords is even higher: 58% of suppressed tweets are compared to 42% of non-suppressed ones. This is a 60% increase in the results.

Using the SPSS to calculate the P-value, shows a significant relationship between COVID-19 and depression in the tweets on Twitter. It means this Covid19 Pandemic did increase the depression of the people who send the opinion on the social media platform.

As I mentioned before, we must correctly recognize, identify, and cope with common mental disorders and mental behavior problems, especially depression and anxiety. During and after the pandemic, common mental and psychological issues such as anxiety and depression need to attract much attention.

From this whole project, I learned several things from the process.

Visualize data is essential. Social Media data always come large, people without specialized knowledge might confuse about what seems clear to the researchers. Thus visualization of the data is good for both.
How you understand and using data will tell a different story. Different people might use the same datasets to demonstrate a totally different perspective. This always combines into different disciplines.
Social media analysis will be a combination of computer science, social science, data science, and in this way, it could help other people from a more broad perspective.
Since the interdisciplinary nature, It will always have an ethical issue on social media data. Especially now deep learning is taking the main role in data analysis. The Neutral Network could predict more deep and more concise information ever than before. It might easily predict who you are, what’s your political preference, what’s your occupation or even very private information.

Conclusion and Limitation

In this project, I use machine learning to analyze the Social Media data on Covid19 and try to find the correlation between the contents and depression.

There may be some bias in this project because the dataset is not in a vast number. Compared to those hundred thousand or even millions of tweets dataset. This definitely can not represent all circumstances. Also, when I design and use the algorithm, there are other ways, like, decision-trees, Bayesian classifiers, support vector machines, neural networks, and other methods. It could be more convincing to do several different algorithms and compare them.

Also, in terms of this research, I use Twitter data to do a Public Health Pandemic analysis. I need to consider the following ethical issues:

a. Privacy, this includes Confidentiality, Data loss, Twitter’s privacy policy.

b. Regulation includes data protection and codes of conduct when doing research.

c. Geographical location information, since we can locate the location and the coordinate and altitude of specific Tweets, also contains an ethical issue in tracking users’ location.

d. The power of tweet data research also causes attention. When the researcher can access so many users and so many data at a short time, the researcher’s self and the power and ability to do this kind of research may also have some ethical and regulation problems.

In the future, I want to add a time-series comparison difference between the depression of the Pandemic and before the Covid 19 was spreading.

Note: This project only contains one group member.

References

Data statistic from https://coronavirus.1point3acres.com/
Twitter’s user growth soars amid coronavirus, but uncertainty remains, Queenie Wong, Jon Skillings. https://www.cnet.com/news/twitters-user-growth-soars-amid-coronavirus-but-uncertainty-remains/
Swayambhu Chatterjee, Shuyuan Deng, Jun Liu, Ronghua Shan &Wu Jiao, Classifying facts and opinions in Twitter messages: a deep learning-based approach, Aug 2018.
David Heckerman, A Tutorial on Learning with Bayesian Networks, Innovations in Bayesian Networks, pp 33–82
Maximum likelihood estimation. https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
Jerzy W. Grzymala-Busse, Witold J. Grzymala-Busse, Handling Missing Attribute Values, Data Mining, and Knowledge Discovery Handbook, pp 37–57

Covid-19 Analysis of Twitter Content

Written by Tao Yao