Global Covid-19 “Anti-Vaxxers” Detection

Nofar Herman

Published in

The Startup

10 min readFeb 25, 2021

Noa Ehrenhalt, Neta Geva, Ron Levy, Yaniv Weiss, and Nofar Herman

The authors mentioned above are all equal contributors to this article and to the described project.

**credit**: NBC universal / Anuj Shrestha

The coronavirus is affecting 219 countries and territories, significantly impacting the health care system and economy of almost every country in the world. Hospitals are overflowing, unemployment rates are increasing, and most countries are currently in recession. As of late February 2021, Covid-19 has caused 2.5 million deaths and potential long-term impacts to the health of the 113 million individuals who have had the disease. A number of vaccines appear to be effective at preventing death and serious illness. However, in order for the vaccines to be effective in controlling this world pandemic populations need to reach herd immunity, thereby reducing the likelihood of serious illness for individuals who lack immunity or are at high risk. In short, the pharmaceutical companies and global science community have offered us the prospect of an exit from the pandemic but it requires receptivity to taking the vaccine. Opposition to vaccines in general (“Anti-Vaxxers”) or to the Covid-19 vaccines in particular threaten efforts to bring an end to this pandemic. The goal of this project is to identify clusters of anti-vaxxers so governments can proactively raise awareness, educate and intervene to encourage take-up of the available vaccines.

We took the following steps to create and select a model to identify clusters of anti-vaxxers in a country:

Identify potential data sources and select dataset
Exploratory data analysis (EDA)
Build predictive model set
Select best model

If you would like to follow this article step-by-step for yourself, you can get all the code and notebooks from our GitHub repository.

1. Identify potential data sources and select dataset

The goal in this step was to find or create a dataset where individuals from around the world post their personal opinions about the Covid vaccine. Twitter, the global social networking platform where users post their opinions via “tweets” is a good fit for the task at hand.

The data was obtained by scraping “tweets” with Twitter’s API. The “tweets” that were selected all contain hashtags that are related to the Covid vaccine. The hashtags were then grouped into two categories, anti vaxxers and non anti vaxxers. A couple of examples of anti-vaxxers hashtags are: #CovidHoax and #IWillNOTComply. On the other hand there are less negative opinions regarding the Covid vaccine, for example: #IGotTheShot and #VaccinesSaveLives.

Pro and Cons in using Twitter as a data source:

Pros:

Twitter is a globally used social network.
Twitter’s API is easy to use for scraping data.
Users “tweet” personal opinions.

Cons:

Twitter is more commonly used in specific countries.
The dataset created is limited to the hashtags that were pre-selected.

2. Exploratory data analysis

The dataset consist of “tweets” from January 15, 2021 through February 18, 2021, and contains all types of data: date, string, boolean and integers. In addition, features that contain mostly NaN values were dropped.

The graph below illustrated that the distribution of “tweets” per day differs by the day of the week.
On Sunday, there are more anti-vax tweets than non anti-vax tweets and on the remaining days of non anti-vax tweets are more popular.

Twitter’s user location feature is a free text field which is not mandatory and contains ‘junk’ data (not related to countries), making it challenging to infer the user’s location from the data. The following python packages were used to overcome this issue: GeoPy and country-converter.

**For more information on how to convert this data please see “Additional Credits” at the end of the article.**

The graph below illustrates the percent of anti-vaxxers in a specific country, yellow indicating the majority of the country are anti-vaxxers and blue indicating majority non-anti-vaxxers.

3. Build predictive models

Logistic Regression Classifier

We decided to use a simple logistic regression as our baseline model, which will be the starting point for all other models that will be tested.

Preprocessing and features:

Numeric features were normalized with sklearn’s Standard Scaler. The numeric features that analyzed by that model were user followers, favorites and retweets
Three additional features were created from the raw “tweet” (the text), that give a sentiment score for the text. In this model used NLTK SentimentIntensityAnalyzer module, to analyze the text and to detects polarity (positive, negative or neutral). The sentiment analysis gives a score between zero and one for the following categories: Positive, Neutral and Negative.

Results:

Logistic Regression Model Classification Report

The success of the model is measured by its ability to identify most anti-vaxxer tweets. Our business question involve in finding all possible anti-Covid-vaccination suspects, hence the recall score have more weight in choosing the model. The recall of this model is 0.42, meaning the model was only able to identify 42% of the anti-vax tweets.

We started with a logistic regression model to understand our baseline metrics and what needs improvement. Based on the results it appears that the data is not in a linear form therefore a more complex model is needed in order to achieve a better recall score.

Random Forest Classifier

The uniqueness of this model is that it does not analyze the words in the tweet, rather focuses on the characteristics of the tweet to make a prediction.

Preprocessing and features:

Tweet length: the amount of characters in the tweet
Special characters: count the amount of special characters in tweet, for example- exclamation marks.
Day of week: categorical feature with 7 values, 1 for every day of the week.
Year user created: the year the user joined Twitter
Special characters in description: count the amount of special characters in description, for example- exclamation marks.

The model ran with its default parameters.

Results:

Random Forest Model Classification Report

The accuracy of the model improved from the baseline 64% to 90% . In addition, the recall improved from 42% to 78%, meaning that the baggin model was able to predict 78% of the anti-vaxxer tweets.

Still, 78% recall seems to be quite low, so we might get a better outcome using a more complex ensemble model.

XGboost Classifier

The Random Forest Classification model is a type of bagging ensemble model which yielded better results than the Logistic Regression model. However, we believe we can achieve better results with other ensemble models. Therefore, we tried XGboost, which is a boosting model that improves the output based on prior weak learning models.

Preprocessing and features:

Reformatting the text by removing all unnecessary characters, tokenized the text and lemmatized each token. Additionally the 100 most common words in all tweets were selected to be used as features for the model with one-hot encoding.
Additional features that the model took in account: number words in a tweet, number of user followers, number of favorites, number of retweets and if is retweeted (a boolean feature)

Results:

The boosting model accomplished better results compared to the previous ensemble model, while establishing a 81% recall. The precision stayed the same, but the accuracy increased by 2%.

Those results suggest that ensemble model could preform well on this type of data, given the right features for the model to train and predict on. That being the case, it is recommended to try a third kind of ensemble model and to see how it will preform.

Stacking Classifier

The third and final kind of ensemble model used was a stacking classifier model. The first step of the model is to create numerous models that are trained in parallel. The second step is to combine the output of these models into a meta-model. The meta-model is trained based on the predictions of the first steps.

Preprocessing and features:

Followed the same preprocessing steps as the XGboost and the Random Forest classifiers.
Moreover took in to account user location along with tweets sentiment. The sentiment analysis was calculated using transformers on tokenized text data.

Model layout:

Random Forest Classifier: 1st phase model.
Support Vector classifier (linear kernel): 1st phase model.
Logistic Regression: 2nd and final phase model.

Results:

Here we obtained the best score all around — the precision and accuracy increased to 95%, while the recall increased by 11% to a total of 92%. Meaning the model is able to identify 92% of the anti-vaxxers in the dataset.

There are often times that neural networks perform well on textual data therefore, we wanted to test a deep learning model on the task at hand.

Neural Network

The model used is a neural network model that applies transfer learning layer. We used a pre-trained text embedding model that was trained on an English Google News corpus with 7 billions words

The advantages this pre-trained layer are:

No need to worry about text preprocessing, the model does it for you.
No need to create and train a neural network from scratch.

Preprocessing and features:

The only feature used in the model is of the tweets themselves, meaning the actual text. Making this model purely a NLP prediction algorithm.

Results:

This model had less successful results compared to the results of the Stacking Classifier It was able to obtain 91% precision, 94% accuracy, and most importantly, it achieved 87% recall. The recall score is the most important metric to identify correctly all anti-vaxxers therefore, the optimal model needs to have a high recall score.

The primary assumption that a more complex model could attain a better recall score was not achieved with this neural network model. For this reason we decided to not continue to more complex models and to stay with the ensemble models above.

Another reason we did not attempt to improve a neural network model was because of its challenges in using it in deployment, where our project is deployment oriented.

4. Model Selection

In order to compare the models and visualize the difference between their success we used a ROC AUC curve and plotted the results. The AUC score is the area under the ROC curve, and this metric measures the models separability. Meaning, how effective is the model in distinguishing between the different classes. A higher AUC score means a better classifier, a perfect classifiers AUC score is 1. The largest area under the ROC Curve (aka AUC), is obtained by the stacking model (0.988).

From that we can gather the stacking model is the best in distinguishing between “anti-vaxxers” to “non-anti-vaxxers”, with a 0.988 AUC score. Following the stacking model, we find the XGboost, Deep Learning and Random Forest models (0.976, 0.971, and 0.948 respectively), and finally the baseline model (Logistic Regression with 0.701).
The results are surprising, the ensemble models perform just as well and in some cases even outperform the deep learning model.

The results are surprising, we can see that the model is predicting better on vaccination-tweets data when it is not a deep-learning-based model, but rather an ensemble model given the right features. Not working with Neural Network based model decreases the complexity and the time of prediction of every sample (and also makes it easier to deploy).

In conclusion,
the stacking model was chosen for the deployment of our model.

Let’s Give It A Try…

For the next step, we should look for specific tweets that could be interesting to test our model with. For example, this tweet:

Anti-Vax tweet example

After scraping the latest tweets from that user, which regards Covid-vaccine, we imported it into a CSV file and presented it as a pandas DataFrame:

Now, we can test our model and try to predict on that example. The results are as followed:

By the looks of it, the model predict that those tweets will be categorized as 1 — aka as anti-Covid-vaccination tweets. That kind of prediction fit the general tone of tweet described above.

If you would like to try the model for yourself, you can enter our website here, and don’t forget to check our global “Anti-Covid-Vaccination” tweets graph.

Additional Credits:

Country preprocessing medium article:
Twitter's user location feature is a free text field which is not mandatory and contains ‘junk’ data (not related to countries), making it challenging to infer the users location from the data. Our business problem is heavily dependent on the ability to identify clusters of anti-vaxxers in a specific region or country.
The article above addresses that issue and was found to be very helpful in our case.
Building the NN text classifier:
The Neural Network model was built as a sequential model, that was based on the model that was described in the link above.

Global Covid-19 “Anti-Vaxxers” Detection

1. Identify potential data sources and select dataset

Pro and Cons in using Twitter as a data source:

2. Exploratory data analysis

3. Build predictive models

Logistic Regression Classifier

Random Forest Classifier

XGboost Classifier

Stacking Classifier

Neural Network

4. Model Selection

Let’s Give It A Try…

Written by Nofar Herman