Data Isn’t Dead — Traditional Polling Is

By: Ramez Karkar, Director, Data Architecture, Mediavest | Spark

Political critics declared big data dead after nearly every major media poll had predicted Hillary Clinton to be the clear winner of the 2016 presidential election.

Put bluntly by Hollywood Reporter journalist Michael Wolff[i]: “It was the day the data died. All of the money poured by a financially challenged media industry into polls and polling analysis was for naught. It profoundly misinformed. It created a compelling and powerful narrative that was the opposite of what was actually happening.”

Wall Street Journal columnist Peggy Noonan wrote[ii], “In America now only normal people can see the obvious. Everyone else is lost in a data-filled fog.”

I’m not buying the sudden death of data. Call me biased since I work in data-driven advertising, but it’s clear that something or someone else was to blame. Today, data is everywhere and we use it every day in our lives and in our jobs. At home we use it to pick a local Thai restaurant on Yelp or to find the quickest route home on Waze. Marketers use data to acquire new customers based on behavioral or demographic data signals.

For the most part, data works really well across all aspects of our lives and jobs. What we really need to do is examine why traditional polling methods failed us in the election and how data advertising strategies can be applied to the next election.

Let’s use one of the first tenets of data advertising: Garbage in, garbage out. When audience data or lookalike modeling fails in advertising, we usually know what went wrong — the data going in was a flawed representation of the desired outcome. A common misinformed assumption is that more data is always better. What we are really looking for when building a predictive model is a highly accurate sample of the most representative users.

In advertising, this might be the highest spending customers only and not inclusive of casual website visitors. So for political models, accurate predictions would need to be based off a sample pool that represents all the major demographics in the United States: age, gender, ethnicity, income, region or job class.

Yet one of the biggest challenges for pollsters in 2016 was acquiring the data in a climate where communication has shifted from landlines to cell phones. Survey response rates have been on a steady decline since the 1970s. Response rates, once at 80% in the ’70s, were only around 5% in 2016[iii]. Household landlines have become less common, and cell phones have made it much easier for people to screen out and ignore survey calls. Text messaging and email have also made speaking to a stranger over the phone a more unwelcoming experience than before. And even when they do respond, they are likely to represent a highly skewed view of the overall population. Or, they might not even admit that they are voting for Trump, as one poll found[iv].

These factors made it harder for pollsters to build a meaningful and accurate sample size using traditional methods. The term big data was a misnomer as the pollsters only had access to small data.

In advertising we always ask questions about how the data was collected and make sure that we understand how data vendors are classifying a purchase intent or store visit. Every little nuance affects the performance outcome. Is their purchase data based off a survey panel and then modeled out or does it represent actual credit card transactions? The pollsters took their sample data for the truth when there were blatant gaps and this in turn affected the models.

It needs to be pointed out that one poll did show consistent success, the USC/LA Times Daybreak Poll. This poll consistently had Trump ahead and showed him favored by 3 percentage points in its final forecast before the election. Why was it successful and what was unique about it?

First, the Daybreak Poll was an online poll so it was able to avoid all the traps of the traditional cold call method. The survey was less susceptible to forced, on-the-spot answering with a stranger on the other line. Users could also respond to the online survey when it was convenient for them.

Second, it was a months-long survey that asked each week 3,000 randomly selected citizens the same questions about their likelihood to vote for each candidate. Collecting multiple responses from the same people over time allowed them to properly assess the impact of the news on respondents and measure the impact, if any, that scandals or debates may have had. Responses to traditional polls, meanwhile, were static.

Third, the respondents were asked to weight their likelihood on a scale of 0 to 100 versus just a yes/no or Clinton/Trump response. The Daybreak Poll attached much more depth to the respondents’ answers, which made the model’s input more powerful.

Finally, the Daybreak Poll had a more complicated weighting system that took into account the gaps in the respondents’ demographic makeup and weighted them to actually represent the diversity of the US population.[v]

A key proponent of the data advertising industry is letting the data tell the story. The insights we learn from customer data should drive how we purchase media — and not the other way around. Ironically, the LA Times reported that “the poll’s findings caused dismay — even outrage — among some readers, especially Democrats, who have denounced it and often criticized The Times for running it.”

This is a large clue that there was immense pressure for many major media sources to coerce the data into representing the opinions of their readers. There is all this talk about data being dead but no one really cared about the quality of the data in the first place. All the readers cared about was that the data supported their beliefs. When biases get in the way, humans are to blame, not data.

Data isn’t dead after this election — the data spoke out loud and clear. People were clamoring for a change and the traditional polls were not representing the national zeitgeist. Hindsight is 20/20, but in the next election, pollsters will need to switch to more modern methods of data collection to increase scale and accuracy.

The LA Times poll should be replicated, and pollsters should also look to incorporate data from social feeds, such as Twitter and Facebook. The polls will also need to be kept separate from the opinions and biases of the media to maintain integrity, and so that the public has a better view of what is really going on in the country.

With the proliferation of devices, screens, cable options, TV channels, social feeds and news sources, accurate data is now more important than ever in connecting us together. Gone are the days when families sat in front of a TV watching the same live programming at the same moment in time. The polls are needed to show us what is really going on. Voter turnout at the 2016 election was the lowest in 20 years despite the TV ratings. If the on-the-fence voters had access to more representative data leading up to the election and knew how close the race really was, I wonder if they would have been more motivated to get off the couch and vote? One thing’s for sure, better data will be much needed in 2020.