Building a “Fake News” Classifier (pt. 2/3)

Brennan Borlaug
7 min readApr 26, 2017

--

This is the second post in a series where I will be documenting my progress on developing a “fake news” classifier. If you haven’t read the first post yet, check it out! This project was completed as the capstone for myself (@BrennanBorlaug), Sashi Gandavarapu (@Sashihere), Talieh Hajzargarbashi, and Umber Singh (@Umby) in UC Berkeley’s Master of Information and Data Science (MIDS) program. In this series, I hope to cover many of the challenges faced and decisions made in developing and deploying a text classifier from start to finish. This post in particular will detail the process of model selection and performance. I’ll also go into detail on the steps taken to ensure generalization. You can try the classifier yourself at http://www.classify.news.

In part 1 of this series, I described our procedure for building our own labeled training corpus for developing a “fake news” classifier. Next, I’ll discuss the text preprocessing routine used to prepare articles for classification, but first, I’d like to briefly explain why we chose to limit our feature space to one which can be derived from only an article’s text and title.

In researching this problem, we found many solutions which used article metadata (specifically, the source name) to make classifications. These solutions perform lookups against blacklist and whitelist tables and output the corresponding label. We wanted to develop a solution that was capable of making article-level predictions. Article-level solutions don’t take the source into account, making it possible for a single source to publish articles that are classified as both credible and non-credible. We believe such a solution is far more valuable. A second approach that we considered was examining networks for online news sharing. We thought that by researching the ways in which credible and non-credible articles were shared online, we might be able to distinguish between them based on propagation patterns. We moved away from this approach, however, because we recognized that in order for a classifier to have any sort of real impact, it would have to be able to identify and label non-credible articles immediately. Waiting until an article had a sufficient propagation network would defeat the purpose of developing a classifier in the first place (i.e., it wouldn’t stop the spread of these articles).

“Micro-propaganda” network of 117 “fake news”, viral, anti-science, hoax, and misinformation websites — Jonathan Albright

Okay, now let’s talk about text processing.

Preprocessing

Performing preprocessing on unstructured text is typically a good idea when attempting to perform text classification. There are many techniques and approaches that one can take and most of them improve classifier accuracy (by removing noise) or generalizability (by removing features that a model is likely to overfit on). In our case, the following preprocessing routine provided good results: 1- Remove capitalization and punctuation; 2- Remove overfit words/phrases (including source names, format-specific words (e.g., one source listed the day of the week in the first line of every article causing our models to overfit on them), and phrases contained in every article –usually a header/footer); 3- Remove short words (words less than 3 characters long); 4- Remove stop words. The table below shows how our vocabulary shrunk after each subsequent preprocessing step.

With this more concise vocabulary, we converted each article to a sparse word vector representation (word counts) then performed a term frequency-inverse document frequency (TF-IDF) transformation. TF-IDF values increase proportionally with the number of times a word appears in a document but are offset by the frequency of the word in the corpus. These values aim to measure a word’s importance to a document in a corpus and our classifiers performed best when using them.

Aware of the limitations of bag of words models, we wanted to engineer several features to capture the tone or context of an article. To achieve this, we performed sentiment analysis on the article titles and texts (using Python’s VADER Sentiment package), and established metrics to quantify punctuation (“?” and “!”) usage and capitalization patterns.

Choosing Our Classifiers

We presented the problem as a binary classification (credible vs. non-credible) one and avoided topic modeling or making sub-classifications (e.g., non-credible: clickbait, non-credible: propaganda, etc.) for our first iteration. We wanted to build an ensemble classifier based on a two-part approach:

1) Content-based approach — Bag of words model

2) Context-based approach — Sentiment analysis, capitalization and punctuation usage

We hoped that the addition of a “context-only” model would offset some of the limitations of our “content-only” approach (bag of words).

For the “content-only” classifier, we tested a number of classification algorithms in scikit-learn: multinomial naïve bayes (MNB), linear support vector machines (SVMs), random forests, and multinomial logistic regression. The linear-SVM classifier consistently outperformed the others (by ~4% prediction accuracy) in 5-fold cross validation testing but did not provide any justification for its predictions. We had reservations against employing a “black box” prediction model, especially since our labels relied on some shaky assumptions to begin with. On the other hand, the MNB classifier performed second best and has a feature_log_prob_ attribute in sklearn that returns the log probability for every feature (word) in the article. By multiplying the difference of the log probabilities of a word given each outcome by the word’s TF-IDF, we obtained a weighted log probability score that represented a word’s impact on the overall classification.

TF-IDF(<word_i>) * [log(P(<word_i>|1)) - log(P(<word_i>|0))]; where: 1 = “Non-credible” & 0 = “Credible”

In our case, we decided that the value gained by providing justification for our model’s predictions outweighed the value of ~4% increased accuracy on an unreliable test set so we went with the MNB classifier as our “content-only” approach.

For the “context-only” classifier, we had far fewer features and so tree-based methods became a more attractive option. We tried logistic regression, random forests, and boosted decision trees (both AdaBoost and XGBoost) and settled on an AdaBoost classifier due to its high accuracy (though still ~10% lower than MNB with TF-IDF vectors). AdaBoost works by first fitting a shallow decision tree on the training set and then fitting additional classifiers on the same data but weighted in such a way that each subsequent classifier focuses on righting the errors of its predecessor. Models are added until the training set is predicted perfectly or, in our case, n=100.

Simple example of adaptive boosting’s self-correcting iterative behavior (source) — the gray data points represent misclassifications.

We figured that a combination of the two models would provide us the best performance in terms of accuracy on a holdout set and generalization to new articles and sources. We didn’t want to give the two models equal weighting, however, because in cases where the two models disagreed, we would be stuck in a virtual stand off. Additionally, the “content-only” model had repeatedly outperformed the “context-only” model, so we wanted it to have a greater influence on the final classification. We also wanted to account for the uncertainty of each model in their respective predictions (represented as probabilities). Thus we developed an ensemble that effectively weighted the predictions of the two classifiers based on their certainty (of prediction) and prediction accuracy in the most recent round of cross validation testing. In the table below, we show the results from 5-fold cross validation testing for each of the three models.

The weighted ensemble outperforms the individual classifiers that constitute it

The weighted ensemble outperformed the “content-only” model by ~4%, proving that by incorporating an additional model built from features aiming to capture an article’s context (or tone), we were able to improve upon our baseline naive bag of words approach.

Generalization

Our ensemble classifier was performing well in cross-validation, but this only proved that it was successful at generalizing to unseen articles published by the same sources that it was trained on. What we wanted to know was how it would perform on new, never before seen sources. To test this, we developed a comprehensive generalization test suite. The steps were as follows:

  1. Randomly split the five credible sources and nine non-credible sources into five groups each (one credible source per group and four groups of two non-credible sources + one group of one).
  2. Train on an equal number of credible and non-credible articles while holding out every combination of credible and non-credible source groupings for testing (25 total tests).
  3. Compare the average accuracy from all 25 tests with the accuracy from cross validation testing.

We found that the average accuracy during generalization testing was ~5% lower than when performing cross validation with articles from every source in our data set. While this suggests that our model was slightly overfitting, the accuracy remained high enough that we were pleased with the results and were confident in our model’s ability to generalize to new articles (from unseen sources).

Thanks for reading! In the final post in the series, I will discuss our data pipeline and deployment. Then, I’ll give my thoughts on the best path forward for eradicating this problem. I’m happy to receive your feedback or suggestions in the comments. ✌️

--

--

Brennan Borlaug

I love unearthing trends in data on topics that interest me! Sometimes I write about it here.