Using NLP to Track Conspiracies on Reddit during the 2016 Election

Jared Delora-Ellefson
Analytics Vidhya
Published in
3 min readJul 14, 2020

Notice: This article assumes a knowledge of Natural Language Processing

It’s no secret that during the 2016 US Presidential Election Americans were awash in conspiracies. A source of much of this conspiratorial talk was on Reddit at r/the_donald. Other studies have been done on misinformation during this period but I wanted to dig a bit deeper. I decided to create an Entity Recognition Model using spaCy to track four of the most well known conspiracies that were popular during the run up to the 2016 election. If you would like more details on how to build an Entity Recognition Model in spaCy, you can read this post I wrote on the subject.

The Conspiracies:

  • Seth Rich Murder
  • Pizzagate
  • Hillary Clinton’s Email Investigation
  • BlueLivesMatter/BlackLivesMatter (which according to the Robert Mueller investigation was heavily pushed by Russian bots in an effort to sew discord among the American population)

The process was as follows:

  • Collect every comment made on r/the_donald during 2016
  • Use this data to train a spaCy model to track our conspiracies
  • Use the model to identify comments from r/the_donald that contain the conspiratorial language we are interested in tracking
  • Analyze the conspiratorial trends
Figure 1: Data Processing

Figure 1 shows the overall dataflow for the project. At over 20 million comments, this data took quite some time to process through the spaCy model. Amazon Web Services were used to speed up the processing.

Figure 2: A demonstration of the spaCy model in use

Figure 2 shows a demonstration of using the model. A string is constructed containing a number of the terms we are looking to locate. spaCy does an excellent job of identifying our terms.

Overall the model performs very well, with low bias and low variance. This is due to the large amount of examples the model was trained on.

Figure 3: Mentions of Seth Rich related terms at r/the_donald

One of the conspiracies that were tracked was the murder of Seth Rich. More about this story can be found here. It can be seen that terms related to Seth Rich spiked on 8/10/2016. Two days before this Rod Wheeler, a Washington PI working for the Rich family made comments suggesting he had hard evidence that Seth Rich was working with wikileaks. This turned out to be a lie, but you can see on the graph that this one lie turned a relatively minor conspiracy into something nationally recognized. Julian Assange later made comments suggesting Seth Rich was working with him, this too was shown to be a lie but drove comments on r/the_donald nonetheless.

This is the first in a series of articles on tracking conspiracies on Reddit during the US 2016 Presidential Election. In the articles that follow I will discuss trends in the data related to other conspiracies and how those trends correlate with the news.

--

--

Jared Delora-Ellefson
Analytics Vidhya

Data Scientist, Mechanical Engineer, Poet, Musician/Producer, DJ