Predicting the Election

Rebecca Rosen
7 min readNov 8, 2019

--

The 2016 election took a lot of people by surprise. This is partly due to the miscalculations of voting predictions made around that time. Many have expressed doubt in the validity of polling at all, however surveying citizens is a useful tool in understanding voter tendencies and organizing a campaign around real needs. So how can this public skepticism be met with clear information to help citizens — and candidates — make more informed decisions about the political landscape?

In an effort to be more transparent about polling limitations, a major news outlet has decided to change the way that citizen surveys are done. Follow along below to learn about exactly how the New York Times partnered with the Siena College Research Institute to create a new polling strategy — one that that includes live updates to upcoming elections, and may even be more reliable than the others.

NYT Website Live Polling Front Page

Going through the New York Times (NYT) and Siena College methodology for the poll, there are 4 main sections that jump out: Obtaining Voter Data, Adjusting Relevant Data, Polling Practicals and Making Predictions.

Obtaining Voter Data:

The Times writes: “Telephone numbers were selected from an L2 voter file stratified by age, region, gender, party, race and turnout in 2014.” So, why do they need these files? And what is L2?

Voter Files

Their poll started with the “voter file”, a data set of nearly every registered voter in the country. They called voters using this list, which makes their poll a “registration-based sample.” This is opposed to a “random sample” (calling everyone in the phone book at random) which has its pros and cons, but is basically impossible in this instance. This is because they can’t be sure that they’re getting an equal distribution across multiple demographics and political preferences if they call at random. And because there’s apparently no other database of phone numbers structured by congressional district, they work with voter files that include location, age, name, and other demographic information.

Voter File Database

L2 provides the voter files for this project. L2 is a private, independent business that “pools the most up-to-date information from public and commercial sources; we screen all of our data for inaccuracies using complex and proprietary algorithms and the most powerful processing tools available today. This way we ensure your mail, digital ads, texts, emails, canvassers, and analytics are based on the best possible data intelligence.” They supposedly ensure effective data matching, hygiene, addend-ability, and privacy and security. Learn more about their work and their products (shown below) on their website.

A list of products offered at l2political.com

Adjusting Data

Although the term ‘adjusting’ can make it seem like something nefarious is happening behind the scenes, data cleaning — along with feature engineering-are normal processes in the toolkit of a data scientist. In this instance, they’ve made nearly 3 million calls and are lucky to get a response to between .69 — 2.8% of their calls. Clearly this is not the whole country responding, so making assumptions is tough. They write that “no poll can really claim something near theoretical purity when 90 percent of people decline to take a poll”.

In this project, the adjusting can be broken down into two categories — weighting and estimating.

Weighting :
This is necessary because the range of citizens who are of one demographic may be overrepresented in the poll as compared to the actual voters who turn up on Election Day. Weighting helps balance this out, by giving a higher (you guessed it) weight to the more sparsely collected demographics.

In this particular poll, the NYT was responding to 2016’s underrepresentation of voters with less traditional education under their belt. One option would be to get more voters from across the education spectrum, but the US census doesn’t actually report education among registered voters. Seeing as many people questioned the validity of polling because of the 2016 polls, this is something they decided to take into serious consideration.

So this is one of the reasons that these pollsters decided to make their poll “response-rate adjusted,” that is, to call more of the kinds of people who are unlikely to respond to polls. Clearly, they will get less responses from people less likely to respond and therefore need to call even more to get more responses. However, these folks also tend to be younger and are considered “low-turnout” voters. Additionally, these features tend to correlate with the number of years of education obtained. So by calling more of them at the beginning of the poll, there is less weighting that needs to happen in the analysis phase to get the numbers more even.

“No poll can really claim something near theoretical purity when 90 percent of people decline to take a poll”

Estimating :
Yet another familiar tool for a data science or statistician, estimation was integral in getting all the necessary components together for forecasting. Far from guessing, this is more akin to feature engineering, wherein voter data files, regional composition and historical voter turn information provided more foundational predictors. They needed to figure out the composition of eligible and likely voters across demographics in order to get the appropriate make up in responses to their poll.

To do this, they used a Bayesian updating equation to weight their model, in part because they don’t have exact information about voter turn out, by demographics. They write: “We derived estimates for the composition of the likely electorate by age, race, turnout, party, gender and region from a vote-history-based model of turnout in the 2014 midterm election. This was also adjusted by the “partisan and geographic turnout patterns of 2017 and 2018 special and general elections in Arizona 8, New Jersey, Virginia, Georgia 6, Pennsylvania 18 and Ohio 12. “ On top of these demographic estimations, district turn out was estimated on a “model of off-year, regularly scheduled election turnout from 2005 to 2017”.

Although they made efforts to counterbalance education with the response-rate adjustments, they still needed to find a way to confidently estimate education levels for the districts. As such, the estimates for education were “based on a model of turnout in the November 2014 voting and registration supplement to the census Current Population Survey, adjusted for changes in turnout by education in the Virginia and Ohio 12 elections. The adjustment is based on a model of validated turnout among Upshot/Siena poll respondents that controls for the variables in the C.P.S.-based model.”

Polling Practicals

All polls were conducted over the phone by a group of volunteers at Siena State College. Voters were contacted on cellular and landline telephones, and interviewers asked for a specific person named on the voter file. If the intended respondent was not available, the interview ended. Interviews were conducted in English, and also in Spanish in districts where at least 10 percent of registered voters were Hispanic, per L2 data You can read more about the details of the call center experience here, or on this week’s episode of “The Daily.”

Making Predictions

Predictions for Michigan’s Congressional race on NYT Website

Finally, there are the predictions themselves, the tense, 49/51 splits. In short, the weights that are applied to the data throughout the polling — from making the calls to processing the data — provide enough statistical certainty to yield a number on their web page. However, this does not come without a fair warning of possible error — under each election prediction it reads “But remember: It’s just one poll, and we talked to only [total # of respondents] people. Each candidate’s total could easily be five points different if we polled everyone in the district. And having a small sample is only one possible source of error.“

Along with the weighting mentioned above, their final model also had weights based on what they thought the voter turnout rates would be, updated with self-reports of voter turn out. As they write: “The final survey weight is equal to the likely electorate weight, divided by the initial probability of voting and multiplied by the final probability of voting, which incorporates self-reported turnout.”

A Final Note:

The methodology for the polling can be found here, along with the aspects of weighting that took place with the live polling results. All of this transparency is done in an effort to convey how difficult and vague polling really is, and to encourage the public to not take it as full truth. They wanted to publish poll results in real time to “help readers understand the limitations of the process.” These results are more reliable because they include healthy margins of error, so each reader can interpret and update their understandings as the days go by. Hopefully we can learn, in real time, that polls are just one of many imperfect tools to help us understand the world around us.

--

--

Rebecca Rosen

Graduate of Flatiron Schools Data Science Immersive currently living in New York City by way of Detroit, MI. Curious about systems, people & effective cohesion.