Toys, Tools, and the Difference

By the time that 1st century Celtic druids claimed to see fortunes in a crystal ball, astrology was already 2,000 years old. For as long as humans have held hopes and fears, eager knowledge seekers across the world have sought to glimpse the future through diverse and imaginative sorceries. None have performed well (citation needed).
Undeterred by the failures of our determined ancestors, we still seek clairvoyance — and we still seek it from curious sources. Take Paul, for instance. Paul predicted the outcomes of Germany’s 2010 World Cup matches with astounding accuracy, thereafter being hailed as an oracle and clairvoyant.
Paul also receives 40% of his body’s oxygen intake through his skin. Paul has 8 tentacles. Paul is an octopus.
But the most intriguing modern fortunetellers are a surprising bunch: data scientists doing cutting-edge research.
Specifically, those who seek to improve predictions about the results of political elections by using “big data.”
At once sexy and infectious, big data is the reigning captain of all catchphrases. “Blockchain” and “A.I.” are currently plotting mutiny.
Technically, the term “big data” refers to any massive datasets and the advanced statistical methods used to analyze them. Colloquially, it connotes the idea that the mountains of digital footprints that we leave behind us are collected, organized, analyzed, and used by others to build incredible tools.
The particular credit card that resides in your wallet is the result of a big data analysis that predicted that people like you are statistically likely to be a profitable to the company under the terms of that particular contract. You may be reading this story because big data predicted that you would be interested in it. Maybe you matched with your one true love because big data predicted that it was meant to be.
For some, this induces terror and tinfoil hats. But to scientists, this represents a treasure trove of answers to diverse questions about past, present, and future human behavior.
The prescience of big data is only just beginning, as a flurry of recent research demonstrates. The Centers for Disease Control famously identified impending flu outbreaks by analyzing Google search trends (e.g., “flu symptoms” or “best way to tell your boss that you’re sick”), albeit with mixed success.
Other researchers have accurately predicted the Amazon sales ranking of books by tracking the number of times they were mentioned on diverse blogs across the web.
Similarly, the ticket sales of new movies can be accurately estimated by the quantity of views on their respective Wikipedia pages, months before their box office opening.
The stock market dances just behind the movements of the collective emotional sentiment of the Twittersphere.
The underlying concept is simple enough: The tiny, individual actions of internet users can be gathered up into an estimate of current public opinion or behavior trends, which help to predict future trends in public opinion and behavior.
Applying big data to forecasting political elections is a similar concept. Researchers mine massive amounts of data that could — in the aggregate — indicate a specific candidate’s popularity. Several recent, peer-reviewed studies have demonstrated statistically significant correlations between a candidate’s number of Twitter mentions, Facebook likes, or Wikipedia pageviews (relative to their opponent) and the votes that they receive (relative to their opponent). More buzz equals more votes.
This has lead to the sexy, infectious assertion that big data from social media is the shiny new way to predict election results.
This notion lands somewhere between “misleading” and “dead wrong.”
Strong and weak predictors.
The math involved in making predictions is pretty simple. Allow me to use an analogy:
Let’s say you are painting your house, and your goal is to reach the top — but you need a ladder. Some ladders are tall and can almost get you the entire way up. Others are short, and — although they are clearly better than nothing — do not get very far.
When striving for a high goal of a perfectly accurate prediction, any piece of information can be some kind of ladder. Some are highly accurate predictors (tall ladders), some are far less accurate (short ladders).
For example, if we were trying to predict how hungry I would be at 11 a.m., a strong, but not perfect, predictor would be the calorie count of my breakfast. This would be a tall ladder that does not quite reach the top. Or, in statistics terms, a strong “significant correlation.”
A weaker predictor would be the utensil I used to eat my breakfast (I tend to eat light breakfasts with a spoons and heavier breakfasts with a fork). This is a short ladder — so although it does get me off the ground, it is only slightly better than a random guess. This is still a “significant correlation,” but it is a weak one.
Not only is the utensil information a weak predictor, but it also has no practical benefit. The whole point of the utensil data was to tell me about the kind of food I ate. If I already know the exact calorie count of my breakfast, then knowing what utensil I used does not add any new information. The utensil data would only be helpful if I didn’t already know my calorie count, because it would give me a (much less accurate) approximation of what that calorie count was (spoon/yogurt vs fork/pancakes). In this case, the shorter ladder is of no help because I already have a taller ladder. Any usefulness of the shorter ladder is already encompassed in the taller ladder.
Mathematically, when we do predictions, these “shorter ladder” variables do not make predictions any better.
In sum, if you want to reach higher when painting your house, it doesn’t help you if someone gives you a second ladder that is of equal or shorter size.
There are only two things that can help you: Either find a second ladder that is taller that the first ladder, or find some sort of way to extend your ladder (like a raised platform to set the ladder on, or an attachment to the top).
Similarly, when trying to improve my hunger prediction, there are only two types of predictor information that will improve it. I could try to find a predictor that is better all by itself (a taller ladder). Or I could find additional predictors that (unlike the utensil data) would give me information that I do not already have. These would extend my ladder and make my prediction even better.
Some days, I eat breakfast at 6:30. Other days, I eat breakfast at 8:30. The elapsed time between my breakfast and 11 a.m. certainly would be a good predictor of my 11 a.m. hunger, and, importantly, this information is not already included in my information about calorie count. Even though the elapsed time might be — on its own — inferior to calorie count, it is still helpful because it contributes new information. So the time that I ate breakfast is definitely an extension, not a useless shorter ladder.
If I use elapsed time alongside of calorie count to make a prediction, the accuracy of that prediction will be significantly greater than if I used calorie count alone.
Easy peasy.
Back to election predictions.
The very best way to find out who the public is going to vote for is to just ask them. Polls are by far the best predictor of how votes are cast on election day. This ladder alone gets us more than 85% up the side of the house toward a perfect prediction.
So, how can we make it better? To chip away at that remaining 15%, we need to understand what information might be missing from poll-based predictions. Also, what are the flaws in poll-based predictions that cause that 15% of error? Are there other predictors that circumvent those flaws and add new information?
If we want to reach higher toward greater accuracy, we either need a new variable that is better than the polls by itself (that is, a whole new, taller ladder) or we need one that — despite being inferior by itself — contributes new information that the taller ladder does not.
Enter big data.
Remember, any predictor can be either the tallest ladder, a shorter ladder, or an extension. The inventive methods and data sources that people use to try to improve upon election predictions fall into one of these three categories.
So the question we must answer is which category big data falls into.
Is it a new, taller ladder that surpasses the polls in its ability to predict elections?
Is it a shorter ladder that is decidedly better than nothing, but decidedly inferior to the polls?
Is it an extension that we can use to make our existing ladder even better?
Together with a colleague at UC-Santa Barbara and my brother at MIT, we set out to answer these questions.
Finding the right extension
One of the largest sources of error in poll-based predictions is that poll respondents often exaggerate their likelihood of voting and even their support for — or knowledge of — certain candidates and issues. People don’t want to admit being uninformed or apathetic, and often won’t admit their support for a candidate that they think is unpopular (although this likely is not the main cause of the misses by pollsters regarding Trump, specifically). Essentially, any time we measure something through self-report, the data is going to have a notable amount of error.
The beauty of building a barometer of public opinion from Google searches, Wikipedia views, Facebook likes, or Twitter mentions is that it sidesteps these barriers.
It doesn’t ask people what they’re going to do. Instead, it observes what they’re already doing. It doesn’t rely on people responding honestly to a stranger on the phone. Instead, it quietly captures the self-motivated behaviors done in the privacy of one’s own web browser. So while there are many ways that big data measures are inferior to traditional polling (such as sampling and construct validity), there are also ways that they provide new, valuable information that polls do not.
Testing the predictive abilities of Wikipedia
Wikipedia is a wonderful thing. Not only does it rival the Encyclopedia Britannica in its accuracy, but (unlike Google and Facebook) it makes its usage data freely available to the public.
For our analysis — of these four digital barometers of public opinion — we chose to use Wikipedia browsing data as an indicator of public interest and engagement with specific candidates. The theory is that the more traffic a candidate’s page is getting, the more public support they will have.]
Specifically, we tested the ability of Wikipedia pageviews to contribute new information that would improve upon a poll-based model that was already rigorous.
To do this, we looked at the two leading candidates from every U.S. senatorial and gubernatorial election between 2008 and 2014. We gathered a robust assortment of polling data and other common variables used to build modern election forecasts, resulting in a prediction model that — in our terms — got more than 90% up the wall.
In fact, our poll-based prediction model out-performed even the strongest models used during those elections.
Wikipedia had to pull a lot of weight if it was to raise this tallest ladder.
Then we extracted the web traffic data for each of these candidate’s pages on each of the 200 days leading up to each candidate’s election.
In the end, we found that each candidate’s Wikipedia activity significantly correlated to their respective polls and to their election results. This is precisely what the similar “big data prediction” studies had also found: that digital trends are indeed a ladder that gets you above the ground.
However, it doesn’t get us very far. Just like the other studies, the correlation is significant but the predictions generated from Wikipedia usage alone would be far weaker than our traditional poll-based prediction. It is a short ladder — only about half of the prediction ability of our polls-based model. So if used by itself, it is of no practical benefit because it is better to just use the polls.
The real question, remember, is if we can improve above and beyond the polls when they are combined. Is it utensil data (redundant) or time elapsed (new information)?
The real moment of truth came after a hierarchical regression test, which showed — without at doubt — that Wikipedia does contribute new information, and made our already-robust prediction model significantly stronger when added on top.
The takeaway of this is that clever, curious digital indicators of public opinion should be seen as a strategic complement to the polls, not as a replacement. In the world of election predictions, the polls are still the tallest ladder, and creative big data sources that reveal social sentiment are but a small extension — albeit a useful one.
To many, the rise of big data heralds the twilight of the days in which academics studied public opinion by cold-calling random telephone numbers from a lumpy chair in a stuffy office. Unfortunately, this is not (yet) the case.
Big data is still pretty cool, though. Basking in the glorious sunrise of the digital age, any inquisitive person with an internet connection and some basic coding chops can unleash an automated search application that returns vast mountains of data about how, when, and where the world’s internet users consume information and interact with each other — complete with breakdowns by location, time, age, gender, social status, and personal interests.
But we need to be able to identify when someone’s use of big data is just a toy — novel and amusing but not practical. And we need to be able to identify when it is a genuine tool.
Without the acuity to understand how to apply and interpret our new magical powers, we can “predict the future” no better than a crystal ball or Paul the Octopus.
Abel Gustafson is a PhD candidate at the University of California at Santa Barbara. His research develops ways to improve the strategic communication of science.
