Socially generated data is all the rage these days. In the fields of marketing and political analysis, data from Twitter, Facebook, Google, Instagram, and even Reddit, has been employed for models aiming for deconstruction and predictions. However, data generated via social media is inherently biased, and often lacking in scope for modeling the behavior of large segments of the population.
Recently, I got in touch with my friend Yosef (pseudonym, he values his privacy), who is studying for his Master’s degree in Data Science at Oxford. He introduced me to a paper by Dr. Taha Yasseri, who employs traffic data from Wikipedia to predict electoral outcomes, which may potentially revolutionize the field of electoral modeling, and perhaps provide some insight for data scientists in other fields.
The Problem with Social Media
The landscape of social media users is heavily skewed towards specific demographics. Take Twitter: a study conducted on behalf of the Association for the Advancement of Artificial Intelligence (AAAI) found that the demographic of Twitter is mainly composed of Caucasian males living in populous (urban) areas. Any analysis of Twitter data would then be influenced disproportionately by this heavily skewed demographic, resulting in a biased snapshot of the whole.
The “silent majority” problem constitutes another vector of difficulty in employing socially generated data. Social media platforms are often dominated by opinionated active users, silencing the vast majority of users who are either silent due to perceived social pressures or are merely content with lurking. Thus, taking a sample from the vast ocean socially generated data is like scooping up the top layer of an anthill: by no means indicative of the whole.
Of course, one can control for these interferences, but deriving a working model from a flawed dataset is not ideal. A better solution is needed.
Wikipedia to the Rescue
Most voters, or just people in general, would attempt to seek out authoritative information for informed decisions. For most web users, Wikipedia has grown from “that source your high school teacher distrusts” to an authoritative voice on all things in life. We use Wikipedia for research, validation, or simple boredom browsing, trusting its team of detail-oriented editors in presenting accurate information in today’s world of fake news and opinion pieces.
Thus, as reasoned by Dr. Yasseri, Wikipedia page views should provide valuable insights into electoral modeling. Unlike Twitter, Facebook, and Instagram, Wikipedia is a neutral, passive platform that does not possess the demographic/silent majority issues that plague other social media platforms.
There are still problems, however. People are far more likely to seek information if they are considering changing their vote, meaning that swing voters are potentially the primary driver of pageview metrics on Wikipedia. Furthermore, a large segment of the populace would often seek out information from news media, instead of from online sources such as Wikipedia.
For those interested in the methodologies employed by Dr. Yasseri, please check out his paper (shout out to co-author Dr. Jonathan Bright). Below is a summary of the findings:
- Wikipedia is a powerful tool for predicting swing votes but is insufficient in predicting overall electoral outcomes. News mentions, basic information of political parties, and other variables must be considered in constructing an overall informed model of electoral prediction.
- New political parties and candidates tend to attract a disproportionate amount of Wikipedia page views. This should be considered should you be interested in modeling with Wikipedia data.
- Media mentions do not drive Wikipedia page views. Media mentions are overly biased towards incumbent parties and candidates, whereas Wikipedia traffic is driven by “novelty,” i.e., the newness of a political party/candidate.
- The study was conducted in the EU, not the US. Unknown variables such as internet access, demographic compositions, and underlying biases could still have an impact. My friend Yosef is currently attempting to apply this model to US regional elections, with outcomes unknown.
Wikipedia traffic data could offer a new vector of approach in predicting electoral outcomes. It could also be applicable in other areas, such as modeling consumer behavior (likelihood of switching from an established brand to a new one), or just human behavior in general (use your imagination). It also shows that, potentially, media biases do not have a significant impact on swing voters, leading to interesting implications when analyzing the Bernie and Yang campaigns of the 2020 Democratic Primaries. However, as always, more validations are needed.