Using Machine Learning to identify Political Affiliations in Reddit posts

5 min readJun 20, 2020

Image from https://www.flickr.com/people/mikemacmarketing/ (via www.vpnsrus.com)

Code for this project is available here.

Since 2015, biased news have dominated media, with different publishers known for promoting specific political ideals. This political divide has only grown rapidly since then, with politically associated forums gaining massive followings online. Prominent examples of this surge are subreddits, or individual forums on Reddit, focused on discussing and promoting certain political values.

Rumours of politically-motivated bots and trolls appearing in Internet discussions to shift popular opinion have been increasingly popular, especially with the upcoming 2020 USA federal election and the expected China-USA cold war. It is therefore imperative to identify the political affiliations of posts and to detect any potential biases.

This project is a modification of the original project from CSC401/2511.

Text Data

We use text data obtained from 23 sub-reddits known for their political affliations:

Left: twoXChromosomes, occupyWallStreet, lateStageCapitalism, progressive, socialism, demsocialist, liberal
Center: news, politics, energy, canada, worldnews, law
Right: theNewRight, whiteRights, Libertarian, askTrumpSupporters, the_donald, new_right, convservative, tea_party
Alt: conspiracy, 911truth

We randomly select 10,000 instances (i.e., posts) from each of these categories to create a balanced dataset.

Pre-Processing

Prior to extracting features used for classification, the text segments in each category are pre-processed. The following steps are taken:

Tokenization: Punctuation and words are split into individual tokens
Part-of-Speech Tagging: Each token is tagged with the expected part-of-speech using spaCy
Stop-Words Removal: Words which are overtly common in the English language and expected to not hold any relevant meanings are removed (e.g., the, a, he, she)
Lemmatization: Words are converted to their base forms (e.g., are, is, was -> be) using spaCy

Feature Extraction

173 features are then extracted from the pre-processed text. Features include:

Numbers of certain part-of-speech tags (e.g., adverbs, second-person proper nouns, common nouns)
Numbers of appearance of certain tokens/words such as slang, question words, fully-uppercased words, and sequentially punctuation marks
Average sentence length, average token length, and number of sentences per post
Bristol, Gilhooly, and Logie norms, Warringer norms, and psychological LIWC features (available already pre-processed)

Note that the majority of these features are purposely low level (e.g., grammatical features and writing style) and are independent from specific assumptions (e.g., certain words are associated with certain political ideals). While higher level features may assist with classification, we focus on lower level features which would typically be non-intuitively associated with politics. This may potentially provide insight into political affiliation of individuals when observing their text not limited to political discussions.

Machine Learning Classification

We use the extracted features in 8 machine learning models to predict the political affiliation of the text:

Linear Support Vector Machine
Radial Support Vector Machine
Random Forest
Multi-Layer Perceptron (1 layer — 100 units)
Bagged Multi-Layer Perceptron (1 layer — 100 units; 10 estimators)
Multi-Layer Perceptron (3 layers — 100, 100, 50 units)
AdaBoost Classifier
Gaussian Naive-Bayes

Classification Accuracies

We split the data into 80–20 training and test. We calculate the accuracy, recall, and precision of each classification for each class.

We observe that most ML models perform better than random guessing (~0.25 accuracy over 4 classes). The Bagged Multi-Layer Perceptron and the 3-Layer Multi-Layer Perceptron models perform best, with the highest accuracies (0.54 and 0.52) respectively.

Precision of each class for each ML model

We also observe that while recall in general is higher, precision is more consistent across the 4 classes.

Training and Test Dataset Ratio

To monitor how much data is necessary for optimal performance, we change the number of training instances and observe the test accuracy of the Bagged Multi-Layer Perceptron.

Out of the 40,000 available test instances used, we observe 6 training dataset sizes. Interestingly, we observe that with accuracy only improves by 1.27 times when the dataset is increased from 1K instances (2.5%) to 32K instances (80%). This suggests that the data per class is well-uniformed, such that providing more instances of each class do not increase performance greatly. This also suggests that the machine learning models may have reached their performance peak with low amounts of data and providing additional data will not proportionally increase the accuracy.

Change in test accuracy for different training dataset sizes

Statistical Comparison of Models

We run 5-fold cross-validation for all ML models to better estimate the performance of the ML models when all data is available for training. Across the 5 folds, we take the accuracy distribution and compare the accuracies of each ML model with the observed “best classifier” from the 80–20 split (i.e., Bagged Multi-Layer Perceptron).

We observe statistically significant p-values (p < 0.05) for all ML models aside from the 3-Layer Multi-Layer Perceptron.

p-values of each ML model accuracy relative to the Bagged Multi-Layer Perceptron accuracy, across 5-fold cross-validation

From these results, we can be confident that the Bagged Multi-Layer Perceptron will perform more accurately than any classifier, aside from the 3-Layer Multi-Layer Perceptron. There appears to be no statistically significant difference between the Bagged MLP and 3-Layer MLP classifier accuracies.

Conclusion

We trained 8 machine learning models to classify text into 4 political affiliations, using data aggregated from 23 subreddits. We observe that low level features may be used to classify text better than random guessing. Out of the 8 machine learning models, we observe that the Bagged Multi-Layer Perceptron and the 3-Layer Multi-Layer Perceptron classifiers perform best, with statistically significant differences from all other classifiers.