Sentiment Analysis on COVID-19 tweets in NCR (Part 2)

Jared Matthew P. de Guzman
DailyDataDosage
Published in
6 min readMar 13, 2021

Written by Cole Torres and Jared Matthew P. de Guzman

NOTE: Some of the terms for Part 2 are already explained in Part 1. Feel free to read and go back on Part 1 should there be terms that are not clear. :)

Part 1: https://medium.com/dailydatadosage/sentiment-analysis-on-covid-19-tweets-in-ncr-part-1-2c49bc8838cf

Evaluating the f1 scores that we have measured in Part 1 of this series (we used the TF-IDF vectorization method to retrieve these), we wanted to see if we could increase the f1 scores of the models that are used to predict the sentiment of the tweets. This second part of the series is intended to explore a different kind of vectorization which is Word2vec, in the hope of improving the model evaluation scores.

Moreover, Word2vec is a natural language processing technique. In Layman’s term, Word2vec is used to predict the words surrounding a particular word. For instance, the sentence is “I had fun watching the movie. I did enjoy it. It was joyful, but the last part of the movie was sad.” Let’s say the word “enjoy” is used to predict its surrounding words. Most likely, the words “fun” and “joyful” would appear since it is similar to the meaning of “enjoy”; whereas, the word “sad” would less likely to appear. Word2vec finds associations with similar words.

To know more about Word2vec, feel free to read this article by Chris McCornick.

Methodology

The steps in using Word2vec embeddings as features in supervised learning is inspired by the method of Vlad Kisin.

Scraping more COVID-related tweets

We decided to scrape more COVID-related tweets to add to the existing data. We were able to scrape 129 more tweets, and they were eventually labelled (-1: negative, 0:neutral, and 1:positive). In total, there are 311 unique tweets. We thought to scrape more data to further have better understanding of how the model scores would result after having more data.

Data Preprocessing

The data preprocessing steps used for Part 1 are just the same for Part 2, except that tokenization is conducted for Part 2. Sentences are converted into tokens before transforming them into vectors — this process is called tokenization. Raw texts are broken down into words or sentences during tokenization, called tokens. By analyzing the sequence of words, tokenization helps in understanding the meaning of the text.

  • rows with missing (NaN) values are dropped,
  • duplicated rows are dropped,
  • all urls are dropped,
  • english stopwords are dropped,
  • all non-alphanumeric signs, punctuation signs, and duplicated white spaces with a single white space are replaced,
  • all rows with sentences with a length of at least 2 words are retained,
  • emojis and emoticons are converted into words.

Tokenized sentences, with the help of nltk package, are constructed. Empty lists are removed to achieve more accurate model.

A tokenized sentence

The data is then split into 80% training data and 20% testing data.

After splitting the data, a vocabulary is constructed with all of the tokens extracted from the training data.

Word2vec implementation

Initializing and training Word2vec model is done afterwards with the following values of some key parameters.

  • size: the dimensionality of word vectors (big values take long to compute)
  • min_count: minimum frequency count of words
  • window : how many closest words will be used as a context
  • workers: number of threads

In order to transform sentence groups into feature vectors, the average vectors of particular words in a sentence should be computed. Getting the sum of the vectors of particular words in a sentences is not possible, since sentences have different word counts.

Furthermore, Word2vec maps words to a vector space, and such mappings are also called embeddings. Word embeddings created from Word2vec can be visualized, in our case, via T-SNE. T-Distributed Stochastic Neighbor Embedding (T-SNE) is mainly used for data visualization of high-dimensional and low-dimensional data. T-SNE computes for similarity measures between pairs of instances. Essentially, T-SNE allows us to understand how the data is arranged both in high-dimensional and low-dimensional space. To know more about T-SNE, feel free to read this article by Andre Violante.

A dimension of 2 is used as the dimension of the embedded space of the T-SNE.

T-SNE plot of word embeddings. The closer the words, the more associated the words are.

On the other hand, extraction of different dimensions of vectors from a list of sentences vectors is conducted.

The steps are repeated for the test data.

Evaluation of results can be done already.

After evaluating, cross validation with a vocabulary of all texts is conducted.

A CV mean score of 0.72 is achieved. We expected to get a lower score than 0.72 since the tweets are in Filipino, but it turns out our Filipino Word2vec implementation is better than we thought it would be.

Models and Evaluation

F1 Micro Scores of the Models

Based on the results, Logistic Regression (L2) has the highest F1 micro score which is 0.837209. Logistic Regression (L2) is also known as Ridge or L2 Regularization. By penalizing the cost function, regularization gets to solve overfitting in a machine learning algorithm. To know more about L2 Regularization, feel free to read this article by Tulrose Deori.

Predictions

Using the code down below, we counted the number of positive samples in our dataset to be 13; while the negative samples amount to 30.

Our L2 Logistic Regression Model predicted 7 tweets to be positive and 36 tweets to be negative.

The bar graph outlines these statistics.

Comments

Overall, the F1 Score of the best performing model (Logistic Regression L2) outperformed the best model (XGBoost) found in Part 1.

Although Word2Vec word embeddings increased the scores, other factors may have come into play like having more labeled data to work with. Lastly, we removed neutral tweets from the model training as it decreased the f1 scores of each model significantly.

Based on the proportions of positive and negative tweets from our dataset, we can still infer that NCR has an overall negative sentiment towards COVID-19 for January 2021.

We’ll be looking into this in our next article, where we would use time-series analysis to analyze specific trends found in the sentiments of the tweets. Stay tuned for this!

Code can be found on this link: https://github.com/coltranetorres/Sentinellium-Sentiment-Analysis

Thanks for reading this article. And don’t forget to follow DailyDataDosage!

--

--