New Analysis Uses Reddit Data to Accurately Predict Popular Vote Margin of the 2016 Election

Published in

Melting Glass

7 min readFeb 6, 2017

One of the holes revealed by the shocking result of the 2016 presidential election is the failure of mass media to accurately portray the state of election. Despite having decades to refine their techniques, swing state polls conducted during the election cycle had huge errors and, along with contributed to the false perception that Hillary Clinton was an unbeatable front-runner. The failure of the media to accurately portray the result of the election has sowed even more distrust of the mainstream media in the hearts of many Americans, making it clear that analysts and pundits will need more robust and comprehensive prediction methods for future elections.

Over the last six months, I have been working on an analysis tool that could provide a blueprint for how media outlets and campaigns should adapt their data-based prediction strategies using social media. The tool analyzes data collected from the social media website Reddit to quantify public sentiment in a predictable and consistent manner. To prove its effectiveness, I made a regression model correlating the Reddit data to the RealClearPolitics (RCP) polling average with an accuracy of upwards of 75%. Using the regression model, I could predict the eventual popular vote outcome within 0.05% of the result using data collected two weeks prior to the election. This margin is a more accurate prediction of the actual election result than any public polling average or prediction model I could find!

In addition to accuracy bonus, social media data has several advantages over tradition polling — collecting it is much cheaper and faster, and the actual data can provide discrete analysis about what campaign issues people care about to an extent unmatched by any poll question. I also used the tool to discover key insights about the nature of the presidential election that even the managers of a billion dollar campaigns did not fully understand. For example, the Clinton campaign believed they had an unsurmountable lead in mid-October after national polls showed her up by as much as 10 points, and used that information to move millions of dollars of funds into campaigning in harder-to-win states such as Arizona. However, the trends in my data show that this poll bump was doomed to be temporary. Without a data-based approach to analyzing the reasons why people have certain political reasons, the media, politicians, and citizens themselves will have trouble deciphering the changes happening in the political world.

Choosing a Source

I decided to use Reddit as my source for a few simple reasons. The API is free and easy to use, and the website itself is organized in a tree structure that makes it uniquely easy to collect and categorize data. The subreddit I collected data from initially, /r/politics, also had very strict rules on what could be posted, ensuring that the data wasn’t too messy either.

However, Reddit still has several drawbacks. The people who browse Reddit are disproportionately young, male, white, and liberal compared to the general population. They also make up a smaller portion of the population compared to giants like Facebook or Twitter — while 44% of American adults used Facebook to get news, only 4% used Reddit. These drawbacks are not unsurmountable; polling itself has always had to counter the effects of small and unrepresentative sample sizes as well.

Quantifying the Source Data

The next step was to collect and quantify data from Reddit. I created a simple script to collect and store months of data from a subreddit (specifically, /r/politics) in a single CSV file. Then, using pandas and matplotlib on Python 3, I started experimenting with how to analyze the data in the best way possible. I came up with a simple method that only had a few steps:

1. Using several groups of keywords, such as “Hillary” and “Clinton”, or “Donald” and “Trump,” gather the total score of comments that mention any of the keywords for each day. I also calculated the difference between the daily score of the keyword groups. Including the total score for the comments dramatically increases the number of people who have their voices heard by the algorithm. Every day, Redditors cast 25 million votes on posts and comments, but post only 2 million comments in the same time.

2. Apply a simple rolling average to smooth the data.

3. Graph these data points over time to analyze trends in what Redditors are talking about.

At this point, I had a graph like the one below. As you might notice, the Reddit data perfectly correlates with every major news event or scandal during the election, as shown in the graph below.

Notes: The golden “Difference” line is simply the data obtained from subtracting Trump’s score from Clinton’s score. The data in this graph does not have a moving average applied to it.

It’s not hard to understand the depth of the utility this data offers. Unlike polls, this data is available in near-real-time, and operates on a very large sample size. However, the data does not mean anything by itself quite yet — I need to show how it relates to the real world to prove its efficacy.

Matching the Baseline

To prove the data related to the real world, I created a regression model that correlated the data in the graph above to the RCP polling average, with an offset of 15 days. The offset accounts for the fact that it usually takes at least a week or two after a news event for the RCP average to only consist of polls conducted entirely since that news event. It also effectively means that the model can predict future poll averages (or the election outcome) two weeks in advance. As shown in the graph below, the model has a hard time predicting raw popular vote share of each candidate, possibly because the actual polling average didn’t predict those figures well either. However, the prediction for the popular vote margin is very accurate — within 0.05% of the actual result, an astonishingly small margin. Of course it’s easier to predict the future when it’s already happened, but this prediction could have been made on October 23rd, 2016, a time when the consensus media opinion still believed Clinton had a very solid lead.

Notes: The dashed lines are from data the model has already seen (‘training’), while the dotted-dashed lines on the far right side are from predictions using new data, (‘testing’). This graph uses moving averages for both the Reddit data and the polling average to smooth out noise. I also had to make a slight adjustment to the raw data to account for a rule change on the subreddit I collected data from.

This correlation isn’t just an outlier. Using the same method with Colorado polls, the model estimated a 5.1% win for Hillary Clinton compared to her actual margin of 4.9%. The polling average for Colorado was much less accurate, predicting a 3.0% margin.

Running the model on polling averages of other states did not always make great predictions, but always correlated with the polls. I was even able to find correlations to the polls of Senate races!

With the baseline proven, I can explain several insights the data provides us.

Insights

Analyzing the data from both the graphs I generated provides quite a few interesting insights about the fundamentals of the election that are not well known.

The results of polls during the election were largely dependent on what was currently showing on the news. Negative press for Clinton meant her lead shrunk, and vice-versa. This also likely means that campaign activities such as rallies or get-out-the-vote efforts had a negligible impact on national polls.
Despite the volatile nature of poll results throughout the election cycle, the long-term trends of my model show that the race was always fundamentally close to the popular vote margin where Clinton was unlikely to win the electoral college.
Although news events did tend to affect polls, a scandal or topic leaving the current news cycle has the effect of nullifying the poll bump or loss caused by the news cycle.
The Access Hollywood tapes specifically may have contributed to a relatively high amount of ‘non-response bias’ in national polling, as my model’s predictions during the aftermath of the scandal were consistently lower than the polling average.
The effect of the Comey letter on the election outcome may have been overstated. Although the news event did cause a notable bump in the first graph, the entire affair happened after the 14-day prediction period of my model. And as noted in point 3, the Comey letter scandal had time to leave the news cycle and thus nullify its effect on the polls before election day.

This list is still likely incomplete, but I will keep working on finding new insights and angles to hit on in the near future.

What’s Next?

Although it has already been a long time since I started working on this project, I still have only touched the surface of what I can explore. Potential areas for future analysis of this data include:

Applying more sophisticated techniques in analyzing comment data, such as sentiment analysis.
Analyzing news sources for bias and accuracy, as well as relative popularity to other news sources.
Using the data to predict other types of polls, such as Presidential Approval

Asides from actual analysis on the data, I also plan to release more on the work I have already done, including the full data sets (over 1 GB in size) and more details on how and why the actual model works.

Another thing I am working on is an online tool to allow anyone to analyze the popularity of any group of words over time on a political subreddit — here’s a sneak peak!

If you think my work is cool and interesting, please like and share it to give it more exposure! If you have any questions, please leave a comment or contact me through the contact page in the navigation bar.