Classifying Political Social Media Posts with Natural Language Processing

Published in

Analytics Vidhya

5 min readOct 30, 2020

Introduction

The analysis of social media posts may be able to tell us just as much about a user’s political views as voting records or traditional polling. In this article, I’ll explain how, with the help of natural language processing , I built classifiers to predict the partisan bias and message of political posts. The data used for this project is available on Kaggle, and contains 5000 Facebook and Twitter posts from politicians that were collected in 2015. The full code for this project is available on GitHub.

Exploring the Data

After loading the data, I plotted value counts for the two dependent variables of interest:

The majority of posts, as these plots show, were labeled as neutral. The most frequent message categories were ‘policy’, ‘personal’, ‘support’ and ‘information.’ The word clouds below offer an informative visual of the most common words in neutral and partisan posts.

Partisan Posts

Partisan posts prominently feature words that discuss policy or legislative issues, such as ‘Obamacare’, ‘bill’, ‘law’ and ‘congress.’ Here is a printout of a sample of the partisan posts:

Neutral Posts

Neutral posts, on the other hand, tend to feature words like ‘veteran’, ‘family’, and ‘community.’ These terms arguably appeal to bipartisan values and are less divisive than policy-related terms. Here’s a sample of neutral posts:

Classifying Partisan Bias

The classification models required the Tweets to be cleaned and numerically encoded. The function below was used to clean the posts after they were formatted as a list of strings:

The posts were converted to numeric data with sklearn’s TF-IDF Vectorizer. TF-IDF is a product of the following weights:

Term Frequency: Number of times a word appears in a document/total number of words in a document
Inverse Document Frequency: log of total number of documents/number of documents that contain the word

Given that the ‘bias’ variable is binary, it was encoded simply by using pd.get_dummies, dropping the neutral column, and using ravel() to convert the column values to an array:

This resulted in an array ‘y’ that consisted of 0 (neutral) and 1 (partisan) values.

As we saw above, the bias classes are heavily imbalanced, with mostly neutral labels. This imbalance resulted in poor model performance in the initial test. I used Synthetic Minority Oversampling Technique (SMOTE) to oversample the minority class in the training data:

Then, using RandomizedSearchCV, I created a random forest model with optimal parameters.

The confusion matrices below show how the model performed on the test and training data.

The model performed well on both the training and test data, with 91% overall accuracy.

Classifying the Message of Posts

I used the same preprocessing steps — oversampling and numerical encoding — to classify the message of the posts. Instead of dummy-encoding, label encoding was used to assign an integer to each unique value in the message column. Given that searching the entire parameter grid wasn’t as computationally expensive with the KNN model, GridSearchCV was used for finding the optimal parameters:

The model had a near-perfect performance on the training data and had 87% accuracy on the test data. This is not particularly impressive, but given the large number of classes, it also considerably out-performs random guessing. In a new dataset with balanced classes, random guessing would be expected to correctly classify just 11% of posts.

In each class row, the highest percentage of true labels matched the predicted label. However, the model misclassified about half of personal posts, and 27% of support posts, as policy posts.

Making Predictions on New Data

Using Tweepy and Twitter’s API, I wrote a function to collect more recent posts from Twitter on which to test the models. The function was used to retrieve Tweets from the 2020 presidential election candidates.

After repeating the preprocessing for each list of Tweets, I used the random forest model to classify the posts’ bias, and printed 5 posts that were classified as partisan:

In both sets of Tweets classified as partisan, there is definitely non-neutral language (‘worst”, ‘far better’) and criticisms directed at political opponents.

Next, the KNN model was used to classify the substance of the Tweets’ message. Some examples have been printed below.

In this Tweet, Joe Biden is reassuring Americans that votes will be counted, and encouraging them to vote. This encouragement, if correctly classified as “support”, is supportive of American voters who have concerns about potential interference in the democratic process.

These first three Tweets defend the Trump administration and its policies. The Tweets refer, specifically, to the administration’s handling of the economy and the COVID-19 pandemic. The bottom two Tweets instead express support for US soldiers, and would perhaps more appropriately be classified as ‘support’.

Conclusion

The random forest model can be used to predict whether a social media post is partisan or neutral with high accuracy. After collecting Tweets or other social media posts for more recent data, partisan posts can be identified and extracted for review by constituents or organizations interested in learning more about a particular politician’s views.

The KNN model can classify the message of a post with significantly higher accuracy than random guessing, and performs particularly well on the following classes: attack, constituency, media, mobilization, and ‘other’.

The exploratory analysis suggested that posts perceived as partisan tend to contain words/word pairings related to specific policy or legislative issues, like ‘law’, ‘health care,’ and ‘immigration reform.’ These terms can be used in queries when attempting to collect partisan posts.

Given the age of this data, and the dynamic nature of political discourse, more recent labeled data of a similar type could support the development of classifiers with higher accuracy on new posts.