“It’s Trump”: The Biggest Analytical Error in the Era of Big Data.
I will always remember the moment I found out -4:16 AM, Wednesday, November 9th — the way I will always remember the moment I found out the twin towers were brought down. Both pieces of news were equally shocking…mind-numbing.
I went to bed the previous night with no doubt whatsoever that I would wake up to an America that just elected its first female President. There were signs all around me here in New York City that many were sure of the same outcome. Earlier in the day I picked up a copy of New York magazine’s Oct 31- Nov 13 issue, which featured -prematurely-a close-up portrait of Mr. Trump with the word “Loser” plastered across his face in a bold red & white design — It was pretty unambiguous what their prediction of the outcome was. Earlier in the day, I forwarded a link to an article authored by the director of the Applied Statistics Center at Columbia University, Andrew Gelman, to my undergraduate students. In that particular piece which was posted on Columbia University’s website, Dr. Gelman declared Mr. Trump only had a 1 in 10 chances of winning the election. That election night, the last thing I said to my 5-year old son before we fell asleep was how happy I was that he is here to share this moment in history with me.
So imagine my questioning of reality, when I woke up again, at 4:16 AM, looked at my smartphone and saw in another set of bold red & white letters, this time on the CNN website, the following: “It’s Trump.” It jolted me awake, yet left me wondering if this was a dream. It was the very definition of surreal. After frantically pressing buttons and viewing a few more websites, I realized that this was not a dream or hallucination. The outcome was very real, even if our data lead us to expect another outcome.
At this point, I must mention I did not vote for Hillary Rodham Clinton, even while looking forward to our country electing its first female President. My disbelief was not at not getting what I chose, but at how much both external data and my own cognitive processes had deceived both me and millions others into thinking that Clinton win was a certitude.
So how did we make what I would argue is the biggest analytical error since the beginning of big data phenomenon?
While no single explanation will suffice, three lines of reasoning from scientific research explain a lot of what went wrong. The first is data quality. A New York Times article titled “Automated Pro-Trump Bots Overwhelmed Pro-Clinton Messages, Researchers Say” cited research that suggests that 18 percent of Twitter’s election-related traffic came from automated accounts, i.e., fake user accounts, aka chatbots. And while Twitter executives deny the social media platform influenced the election in any way, we can’t ignore the fact that, according to the same research, the top 20 accounts on Twitter, most of them bots and automated accounts, send out 1,300 tweets/day which generate an additional 234,000 tweets. The study also noted that most of these automated accounts tend to propogate negative news much more than positive news. This wellspring of fake data, which does not represent the views of any particular individual (s) will now be analyzed to predict behavior of actual people. And from the perspective of the voter, the negative news they are programmed to disseminate will only serve to isolate the voters from each other and the real candidates even more. Without ever questioning its origins or veracity, we placed too much emphasis on social-media-generated data when we thought about the elections.
The second explanation for our inability to accurately predict the election is what we refer to in Statistics as a biased sample. All the data in the world are worthless if they are not from a sample truly representative of the population they intend to make a prediction about — a feat much harder to accomplish than it sounds! The predictions that were being made were based on polls that sampled their participants in a biased way. The sample should’ve been representative of the people who actually turned out to vote on November 8th, not simply people who shared the pollsters’ opinions, those who a possibly biased sampling method reached, or those who chose to respond to the poll (BTW: Who actually participates in these polls?) No statistical test can rescue a biased sample. We know that the various polls and predictions were based on a biased sample and not a truly representative sample because they got the outcome wrong (they haven’t been able to accurately predict the election for quite some time now). The failed to predict who the actual voters voted for on November 8th.
The election proved that despite the unprecedented ability we may currently have to collect data, more isn’t always better when it comes to using that data to predict behavior. Most big data are collected in ways that are biased in many respects, the first being they sample a certain demographic i.e., those who choose to have a social media presence such as Facebook, LinkedIn, Twitter, Snapchat, Instagram etc., and them among them, those who further choose to participate in the data collection effort. These modes of communication are usually limited to the tech-savvy, coastal elites, who have the time to provide specific data. Datasets can only predict behavior for precisely the population they represent, not any others outside that strictly defined population. That so many people predicted the wrong outcome for the election suggests they were all using the same biased data. And there is no statistical way to rescue data collected from a sample that may be of limited representativeness. They remain useless for many purposes.
The other major explanation for why the “It’s Trump” declaration came as a shock to many of us lies within our own cognitive processes. It is a phenomenon called the Confirmation Bias. Confirmation bias, as the term is typically used in the psychological literature, connotes the seeking or interpreting of evidence in ways that are partial to existing beliefs, expectations, or a hypothesis in hand. In other words, we tend to place more emphasis or seek out evidence that support our pre-existing beliefs while ignoring or downplaying the evidence that refutes those beliefs. Cognitive Psychologist, Jonathan Evans, suggests that “Confirmation bias is perhaps the best known and most widely accepted notion of inferential error to come out of the literature on human reasoning.” The confirmation bias is the main reason for the existence Scientific process as a means of discovering reliable knowledge about the world around us. With respect to the presidential election, the idea of Trump presidency was so unpalatable we couldn’t bring ourselves to believe it could happen. We had to believe the alternative: A Clinton presidency. And we evaluated the evidence in a way that bolstered that belief.
What is the point that I’m trying to make here? “Big data” has its limitations, perhaps more limitations than strengths if not handled properly. Exactly what are all these data being collected good for? And is the handling of big data an art or a science? Given the points I discussed in this article, I suggest we approach the collection and analyses of Big Data more as a science and not an art.