Can Wikipedia predict the 2016 US Presidential Election?

Zareen Farooqui
Becoming a Data Analyst
6 min readJul 27, 2016

Here’s what I learned while exploring Wikipedia Clickstream Data:

  1. Wikipedia could have predicted the Republican nominee many months ago.
  2. Bernie Sanders is much more researched than Hillary Clinton.
  3. People are more curious about the 2016 presidential candidates’ personal lives than their stances.
  4. Google > Facebook > Twitter for politics.
  5. Neither vice president running mate is very well-known.

The Wikimedia Foundation intermittently releases Wikipedia Clickstream Data. This data shows where the traffic to Wikipedia pages comes from (like Google, Facebook, Twitter or other specific Wikipedia pages). This helps to understand what people are curious about.

I followed this article by Ellery Wulczyn, a data scientist at the Wikimedia Foundation, to explore the February, March and April 2016 Clickstream Data.

What is the data?

  • prev: where the hit came from (another Wikipedia article, Google, etc.)
  • curr: title of current article
  • type: “link” if prev and curr are both Wikipedia articles, “external” if prev is a non-Wikipedia source (Google, Facebook, etc.), “other” when prev and curr are both Wikipedia articles but the curr article was searched for or the prev article was spoofed
  • n: number of times that link was visited that month

Candidate Popularity Analysis

Below I explore a simple question — can Wikipedia page visits predict the success of the 2016 Presidential Candidates? I analyze drop outs from the Republican and Democratic nomination race for 3 months (February, March and April 2016) to answer this.

Here is a graph showing the number of visits to the Democratic and Republican candidates’ Wikipedia page in February 2016:

It’s no secret that part of Donald Trump’s campaign strategy is to dominate the media, but holy shit. His Wikipedia page got more than twice as many hits as the next researched candidate, Bernie Sanders. Knowing how the Democratic nominee plays out, it’s also interesting that Sanders got over twice as many hits as Hillary Clinton.

Of the 10 least popular candidates on Wikipedia, 8 dropped out of the election in February 2016.

This is the March 2016 data with candidates still running:

Even crazier. In March, Trump’s Wikipedia page got more than ten times the visits as the next leading candidate. His page views also more than doubled what they were in February. Meanwhile, the bridge between the Democratic candidates narrows, but Sanders still leads by over half a million visits.

In March 2016, Marco Rubio and Ben Carson, the two least researched candidates dropped out of the race.

Let’s look at April.

All candidates got significantly less views than the previous months, yet Trump still takes the lead on page visits by a landslide. Clinton trails Sanders by more than 400,000 views.

No candidates dropped out of the race in April 2016. However, in May 2016 Ted Cruz and John Kasich called it quits.

Wikipedia data successfully predicted Donald Trump would become the Republican nominee as less popular candidates dropped out of the race. It wasn’t as accurate with the Democratic nominee. Sanders’ popularity on Wikipedia may make sense though since Hillary Clinton has been a household name for many years. Unless you’re really into politics or live in New England, you might not have known who Sanders was until he started running for the election.

Also, popularity on Wikipedia doesn’t mean support or endorsements. It means people are interested in learning more about that candidate.

Personal Lives > Everything Else

This sankey diagram shows incoming and outgoing traffic to Trump, Sanders, and Clinton’s pages. The thickness of the gray lines (called links) illustrates the volume of traffic flow between Wikipedia pages (called nodes).

The 2016 presidential primaries and election pages only drive a small amount of the total incoming traffic to these candidates’ pages. Interestingly, most people who continued onto another Wikipedia article, went to that candidates spouse’s pages more than any other link on their article.

There is an inherent bias to click on links at the top of a page compared to links which appear later on Wikipedia pages.

I quickly scanned the candidates pages to see if maybe that bias existed here. Not quite. On Trump and Sander’s pages, spouse names don’t appear in the initial full-screen view of the page.

Jane O’Meara doesn’t show up in this initial view of article

They first show on the right hand side menu, after you scroll down a little and not again until at least half way down the page.

1st link to Jane O’Meara
2nd link to Jane O’Meara is half-way down Sanders’ article

However, on Clinton’s page, her spouse’s name appears twice in the first paragraph so it makes more sense that people click on to this page.

2 links to Bill Clinton in the first paragraph

Social platforms and politics

March 2016 FB, Google and Twitter traffic to candidate pages

Again. Trump dominates social platforms. This diagram visualizes March 2016 traffic flow from Facebook, Google and Twitter to candidates’ pages. It’s nearly impossible to see traffic from Facebook or Twitter to Clinton’s page because it’s such a small value compared to the other traffic (657 clicks from Twitter, 1638 clicks from Facebook).

It’s surprising how little traffic Twitter drives. My guess is that Twitter drives much more political traffic and buzz than this visualization displays, just not to the candidates’ Wikipedia page. Perhaps tweets typically link to sources other than Wikipedia, like NY Times, YouTube and blogs.

Vice Presidents?

In the past two weeks, Trump and Clinton announced their vice president running mates. Unfortunately, the Wikimedia Foundation hasn’t released data since April 2016 (if they do, I’ll update this) so my data is slightly outdated.

In April, Tim Kaine and Mike Pence were not very researched on Wikipedia.

number of visits to Wikipedia pages

Here are some pages which received more visits than either of the VP nominees:

Here are some pages which received more visits than both the VPs combined:

How I did this

Here’s a link to my python code I wrote in a Jupyter Notebook. I used the Pandas library for analysis and Google Charts for my visualization engine.

These are the datasets I used:

  • February 2016 Wikipedia Clickstream (1.25 GB)
  • March 2016 Wikipedia Clickstream (1.18 GB)
  • April 2016 Wikipedia Clickstream (1 GB)

--

--