Building a “Fake News” Classifier (pt. 3/3)

This is the third and final post in a series where I document my progress on developing a “fake news” classifier. If you haven’t read the first or second posts in the series yet, check them out! This project was completed as the capstone for myself (@BrennanBorlaug), Sashi Gandavarapu (@Sashihere), Talieh Hajzargarbashi, and Umber Singh (@Umby) in UC Berkeley’s Master of Information and Data Science (MIDS) program. In this series, I hope to cover many of the challenges faced and decisions made in developing and deploying a text classifier from start to finish. This post in particular will discuss our data pipeline and deployment. I’ll wrap the series by giving my thoughts on the best path forward for eradicating this problem. You can try the classifier yourself at http://www.classify.news.


Data Pipeline

Our goal was to construct a pipeline that allowed us to frequently retrain and deploy our model as new articles were received over time. While we never actually implemented this on an automated basis, it would have been simple for us to do so nightly (after downloading the day’s batch of new articles). Before we get too deep into the details, here’s a flowchart I put together to help visualize our data pipeline for this project:

Data pipeline for the “fake news” project

As mentioned in a previous post, we wrote web scraping scripts in Python to extract news articles published daily by each of the sources on our source list (five credible and nine non-credible). We stored these articles on an AWS server running a Jupyter notebook server (for EDA and prototyping). We wrapped our model retrain code into a Python function and created a module to allow the function to be used freely in other Python scripts on the server. By adding the training function to our scheduled article extraction script, we could have automated the entire process. If our plans were to continue to support this project into the distant future, we would have done this, however we will run out of our allocated AWS and Google Cloud funds by mid-May and will likely be archiving this project at that time. Currently the model is being retrained and re-weighted weekly.

We pickle the fully trained models and send them and the updated weights to a Google Cloud server hosting a Flask REST API. This API receives web requests containing the URL of a news article and extracts the article content, processes its text, classifies it, and returns the classification. We use GitHub Pages to host our web application. This application serves as our front-end for introducing people to the project and for making API requests.

Web Application

Our web application is currently up and running! Check it out here. You can classify news articles for yourself or check out a number of visualizations that we’ve generated to examine model performance and corpus-level characteristics.

Visualizations page of our web application

Summary

If you’ve read this far, thank you for showing interest in our project! It was very educational for us and provided the opportunity to build a data application from the ground up — We even had to build our own training corpus! I wish we could say that we solved the problem of “fake news” but we can’t. This problem is far more controversial and multi-faceted than the SPAM problem that inspired our attempt. Everyone has a different view of what is true and what is not, and if people can’t even agree on this, it will be difficult to develop an algorithmic way of doing it for them. However, our work does suggest that traditional natural language processing techniques can be used to derive predictive features from news article text. This is a start.

We’re convinced that a large ensemble of models who specialize in detecting one or more sub-categories of “fake news” will be the ultimate solution for this problem. In the meantime, simpler classifiers can be used to mark the worse offenders on news sharing websites (like Facebook and Twitter) and flag others for human fact-checkers to review (ultimately easing the load for them). To conclude, we think its important to democratize the determination of what is true and what isn’t. While some may be more qualified than others to make this determination, by creating a subcommittee to serve as the authority on truth you are destined to lose the trust of a large percentage of the public.


Thanks for reading! Check out our web application if you haven’t already. I’m always happy to receive any feedback or suggestions in the comments. ✌️

Show your support

Clapping shows how much you appreciated Brennan Borlaug’s story.