Sentiment Analysis on COVID-19 tweets in NCR (Part 1)

Cole Torres
DailyDataDosage
Published in
5 min readJan 21, 2021

There had been pandemics in the past, but it seems like COVID-19 pandemic is unprecedented, never been experienced by the human race before. The world has been suffering and reeling from the onslaught of COVID-19 for almost a year now. Fortunately, vaccines are already in distribution although there has not been an accurate timeline of the medicine’s effectivity.

It was more than a century ago when the “1918 Spanish Flu” pandemic took millions of lives world-wide, the most severe pandemic in history prior to COVID-19. There was the SARS outbreak in 2002, but the intensity of the current pandemic is unparalleled in recent history.

Moreover, it is beneficial to know the sentiments of the Filipinos regarding the COVID-19 pandemic — this can potentially serve as a call to action on various government entities on how to deal with public issues amidst the pandemic and in stopping the spread of the COVID-19 virus.

Methodology

These are the steps (in chronological order) that are considered in doing the supervised sentiment analysis.

Scraping COVID tweets

A location-based scraper is used to get the NCR tweets.

  • The scraper is for the entire Metro Manila
  • The past 7 days are only considered, the scraped tweets are based in the whole 2nd week of January 2021.
  • The scraped tweets are with the keyword=’covid’

(More details regarding the scraper is found on the notebook. The link for the scraper can be found below.)

In total, there are 7000+ scraped tweets in NCR. However, it is noticed that the scraped tweets are duplicated, mainly because they are retweeted. Therefore, the unique tweets have to be considered, and there are 182 tweets in NCR for the past seven days. Also, the tweets are ‘taglish’, meaning it is mixed of tagalog and english words.

Labeling the scraped tweets

The 182 unique tweets are labelled. The target values are the following, -1: negative, 0:neutral, and 1:positive.

Data Preprocessing

The data preprocessing is inspired by the method of Rafał Wójcik (https://gist.github.com/rafaljanwojcik/f00dfae9843dadc0220eba3d36694e27)

The steps in the data preprocessing are:

  • rows with missing (NaN) values are dropped,
  • duplicated rows are dropped,
  • all urls are dropped,
  • english stopwords are dropped,
  • all non-alphanumeric signs, punctuation signs, and duplicated white spaces with a single white space are replaced,
  • all rows with sentences with a length of at least 2 words are retained,
  • emojis and emoticons are converted into words.

Vectorization

Usually, splitting the data is done before vectorization (i.e. the vectorizer shouldn’t be fitted with the test set of tweets). However, since we have a small dataset, it is more likely for words in test set to not be available in the train set (which will cause inaccuracies in the model). So in this case, let’s just vectorize the full dataset before splitting (the dataset is split by 80% training data and 20% testing data). TF-IDF scoring is used to calculate the tf-idf score of each word in each tweet. TF-IDF is conducted to take into consideration how unique every word for each tweet, and to increase the negative/neutral/positive signal correlated with words that are particular for a given tweet in relation to the text corpus.

Models

The table shows the accuracy scores of the machine learning models. Apparently, XGBoost has the highest test accuracy score. XGBoost uses a gradient boosting framework. Gradient boosting is an algorithm that combines weaker models to accurately predict a target variable. On the other hand, XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions). Source: AWS Documentation

Moreover, the models are featured engineered with their best parameters, given the specific hyperparameters that are used. On the other hand, the Proportional Chance Criterion (PCC) should be checked especially if the accuracy models are considered; to assess whether the accuracy of the models are remarkable or not. PCC is essentially the sum-of-squared proportions of each class. When doing classification, checking the model accuracy is always relative to the PCC score. For instance, if we have 5 points in our dataset, let’s say [‘A’, ‘B’, ‘A’, ‘C’, ‘B’]. PCC is equal to (A / 5)² + (B / 5)² + (C / 5)², where A, B, C are whole numbers. Hence, PCC = (2/5)² + (2/5)² + (1/5)². Furthermore, to interpret the test accuracy / PCC on the table, let’s say we have an accuracy of 50%, this happens when the population of each class is equal, so if the accuracy if 50% then the test accuracy / PCC equates to 1. Hence, the greater the ratio of test accuracy to PCC than 1, the better.

Evaluation

F1 Micro Scores of the Models

Micro F1 score is used to evaluate the accuracy of the models. Micro F1 is the appropriate metric to use for evaluation because there is a class imbalance in the dataset, the number of positive, negative, and neutral labeled tweets are unequal. Based on the results, XGBoost has the highest Micro F1 score which is 0.675676. Therefore, the XGBoost model performed the best out of all the models.

Predictions

Based on the pie chart, we can conclude that most of the COVID-19 NCR tweets are negative sentiments.

Comments

The model scores might still be improved, by scraping more tweets, considering other parameters to be tuned in the models, and other vectorization method such as Word2vec or Doc2vec. There are definitely other ways to improve the model scores, so feel free to comment your suggestions aside from the mentioned methods on how to improve the model scores.

Code can be found on this link: https://github.com/coltranetorres/Sentinellium-Sentiment-Analysis

Huge thanks to my teammates in Sentinellium DS team whom I enthusiastically worked with on this project (Harlee Quizzagan, Elijah Medina, Jared de Guzman, and Dion Bautista), and to Noel Victorino for giving suggestions in doing the supervised sentiment analysis.

Feel free to connect with me on LinkedIn, would love to connect:) https://www.linkedin.com/in/cole-torres-185000/

--

--