Predicting the Political Alignment of Twitter Users

An exploratory analysis of traditional machine learning methods and transfer learning method in Natural Language Processing: “OpenAI GPT” for a binary classification task involving custom political Twitter data

Rohit Arora
The Startup
8 min readMay 19, 2020

--

“NLP is an attitude and a methodology, not the trail of techniques it leaves behind.” — Richard Bandler

It has been a hugely exciting time in the field of Natural Language Processing (NLP), in particular, for transfer learning — a technique where instead of training a model from scratch, we use models pre-trained on a large dataset and then fine-tune them for specific natural language tasks.

In recent times, methods such as OpenAI GPT and GPT II and Google AI’s BERT have revolutionized the field of transfer learning in NLP by using language modeling during pre-training, which has significantly improved on the state-of-the-art for a variety of tasks in natural language understanding. One interesting aspect of the transfer learning methods mentioned above is that they use language models pre-trained on well-formed, massive curated datasets that include full sentences with a clear syntax (such as Wikipedia articles and the 1 billion word benchmark). The natural question that arises is — how well can such pre-trained language models generalize to natural language tasks from a different distribution, such as Tweets?

The goal of this post

In this post, we’ll discuss mainly the procedure to prepare our own political Twitter dataset, test the reliability of the dataset, measure the performance of various machine learning algorithms and finally investigate the results of different transfer learning and language models on the task of binary classification to identify the political alignment of users with respect to BJP(Bharatiya Janata Party).

The main aim of this post is to help answer the following questions:

  • How to build a custom Twitter dataset that is fair and not biased towards any particular #tags and keywords and how can we test that our dataset is generalized and reliable?
  • How do the classification results vary based on the use of traditional machine learning models and transfer learning-based language models?
  • Does our fine-tuned language model (and classifier) generalize to the unstructured and messy language syntax of Tweets?

Dataset Preparation

For data curation, the Tweets were scraped for a period of 12 months i.e feb’19 to feb’20 using GetOldTweets3 which is a python3 utility to get old tweets. We did not use traditional tweepy API as it provides access to the past 7 days' data only. Since our focus of this work was political tweets pertaining to BJP, we defined two lists, “pro-party “ and “anti-party” containing the most trending BJP hashtags during the period of 1 year. The aim of defining such lists was to label a tweet as political if it contained any of the hashtags of their respective lists. The “pro-party” and “anti-party” consist of hashtags in support for the party and resistance to the party respectively.
We first scraped the data for a certain period using these hashtags and identified the users who actively post-political tweets by their tweet frequency. We retrieved the tweets for these top users for a period of 1 year.
In addition, we manually screened pro/Anti users based on their profiles and extracted their political tweets.

Some of the Pro-Party and Anti-Party hashtags

For annotating the tweets, we performed sentiment analysis after preprocessing, on the two sets of tweets using Textblob as it has been proven to work with good accuracy. The dataset contains tweets that were selected based on a certain threshold.
The same procedure was followed for preparation of the labeled test set
The train set consists of 2000 pro-bjp tweets and 1800 anti-bjp tweets and test set consists of 24 pro-bjp users and 23 anti-bjp users, each user having 30 tweets each

Word Cloud Visualization

Feature Generation

Focusing on tweet text, for now, we have generated features using 3 different methods :

TF-IDF: An advanced variant of the Bag of Words Technique that takes into account the word based on its rarity. We created a tf-idf vector for each tweet. The main drawback of this approach is that it does not take word order into account thereby ignoring the context. So we also used feature generation techniques that preserve contextual information.

Word Embeddings: A word embedding is a learned representation of text where words with similar contextual behavior have similar representations. These representations are dense and distributed.

  • Word2Vec: Developed by Tomas Mikolov, et al. at Google, it is a shallow neural network-based technique that is used to generate vector representations of corpus words by taking corpus words as input. The vector representations are such that words that share common context in the corpus are located in close proximity to one another in vector space. There are two architectures: Continuous Bag of words (CBOW)and skip-gram.
Word Visualization of Word2Vec Words
  • Glove: Global Vectors for word representation is an extension to Word2Vec. Instead of using separate local context windows in a stand-alone fashion, it introduces a word-word co-occurrence matrix that encapsulates the co-occurrence probabilities and their ratios which has proven to capture the more meaningful semantic relationships and word embedding. We used a pre-trained GloVe model which was trained on 2 billion tweets with 1.2M vocab with the output setting of 200d. We have used an open-source library called gensim for implementing Word2vec and GloVe. To generate a vector representation of the tweet from word embedding, we performed a weighted average of the word vectors.
    Figure below. shows the visualizations of the word vectors achieved by Word2vec.
Visualization of word vectors from word2Vec

Classification Methods

We have performed binary classification using traditional machine learning models. We have also applied a Deep Learning model based on Transfer Learning on raw tweets. A user is classified as Pro-BJP when more than 50%(inclusive) of his tweets are predicted as Pro-BJP. We have applied the following approaches :

Traditional Machine learning models:

  • Naive Bayes(MNB): Naïve Bayes classifiers are based on Bayes’ theorem. It is a generative model based on the assumption that features are independent. Naive Bayes is a simple and fast classification algorithm.
  • KNN: k nearest neighbor classification is a vector space classification method that assigns the majority class of k nearest neighbors to the testing document where k is a parameter[1]. We used sklearn grid-search to decide the best value of k which came to be 3.
  • SVM: it is a large margin classifier where the goal is to find a decision boundary between two classes that are maximally far from any point in training. For non-linear boundaries, SVM works by embedding data in a high dimensional space and attempting to find the hyperplane that best separates data into two classes[1]. We have used linear Support Vector classifiers for our classification task.

Transfer Learning using GPT:

GPT task-specific heads

OpenAI’s GPT model is a large-scale task-agnostic unsupervised language model. Their methodology is based on two main ideas, transformers, and unsupervised pre-training. Pairing supervised learning with the huge knowledge gained by the unsupervised pre-training we can fine-tune its knowledge to achieve SOTA performance for any specific text related task.

GPT can generalize to a range of natural language tasks. To accomplish this, they allow the definition of custom “task-specific heads” as per Fig. below. The task-specific head acts on top of the base transformer language model. for our task of tweet polarity prediction, we are using the classification task head where we pad every text (representing each Tweet, in our case) with a start symbol and tokenize them for input to the encoder layer. In the case of training, the language model and classifier fine-tuning are both done simultaneously, thanks to its parallelized architecture using multi-headed attention.

So what did we observe from all this experimentation?

  1. Best performing models were obviously the language models with transfer learning. Their results are as follows: a) SVM with pre-trained Glove embeddings with highest weighted F1 — score of 0.89 and accuracy of 89.3%. b) Pretrained GPT model fine-tuned on our data.
  2. Tf-Idf is not able to provide important features as it is a count-based model that does not capture the contextual relationships.
  3. The reason for the better performance of GPT and Glove model is due to large scale unsupervised learning previously done in both the models.
  4. The validity of the dataset is important. To prove that it is general enough we compare the result of fine-tuned GPT model on our data with fine-tuned GPT model of SemEval task 6 data \cite{mohammad-etal-2016-semeval} for tweet stance prediction, and we observed that model overfits on both datasets quickly and has similar performance (73.4 for own, 71.69 for semeval)

Results

Results from TF-IDF
Results from Word2Vec
Results from Glove
Result from GPT

Conclusion

In this project, we have predicted the alignment of a twitter user based on his political tweets with respect to a political party, in our case its BJP. Our work primarily revolves around curating a dataset that is not biassed towards a particular set of keywords so as to give a fair means to judge a user's polarity. The main analysis was done using traditional machine learning models and transfer learning models. And we also gave a brief analysis that supports the reliability of our custom dataset making it usable for various other future analyses.

Contributions

This project was a product of collaborative effort between Akanksha Malhotra(MT18062) and Rohit Arora(MT18115).

All the tasks from dataset creation to feature extraction to the training of traditional machine learning models were completed by Akanksha Malhotra with a moderate contribution and suggestions from other team members.

Training of the Transfer learning model was done by Rohit Arora with moderate contributions and suggestions from Akanksha Malhotra and other team members.

link for Poster and Report.

Acknowledgment

For any queries feel free to contact us:

Rohit Arora: Twitter

Akanksha Malhotra: Gmail

This all wouldn’t have been possible without our mentor Dr. Tanmoy Chakraborty for the course “Information Retrieval” at IIITD. Please feel free to contact Sir and our wonderful TAs:

Dr. Tanmoy Chakraborty: Linkedin.

Anubhav Shrimal: Linkedin

Vrutti Daxeshbhai Patel: Linkedin

Abhinav Gupta: Gmail

Hridoy Sankar Dutta: Linkedin

Jasmeet Kaur: Gmail

--

--