7 min readMar 26, 2018

Project Description

We have published eight models to predict elections in Spain, Mexico, Colombia, Netherlands, France and UK. Our last forecasting models were for the Catalan elections in December 2017, Colombia and México in 2018.

Other media outlets have published forecasting models around the world, but I believe that no one has build so many in the last two years. These are the models we have published at EL PAÍS:

Colombia 2018 (in progress)
México 2018 (in progress)
Catalunya (here in English) 2017
France (round 2) 2017
United Kingdom 2017
Netherlands 2017
Pais Vasco 2016
Galicia 2016
USA 2016 (poll of polls only)
Italy 2018 (poll of polls only)

All our models share three characteristics: 1) they are based on polls, 2) they consider polls imprecisions, and 3) they are probabilistic: their predictions are expressed as probabilities. We make forecasts such as “Macron has 76% chance of being in the 2nd round”, “Le Pen has 2% chance of winning” or “there is a 54% chance of a pro-independence majority in Catalonia”.

All our models start with a ‘poll of polls’, visualized as follows:

Two examples of poll of polls for Catalan elections, 2017

These charts are designed to convey the notion of uncertainty by showing how different pollsters make different predictions. Polls of polls are also nice to smooth trends and make our election coverage less noisy. And, of course, we also know that averaging polls make them more accurate (See the work by the statistician Andrew Gelman et al., here in pdf.)

Why building a statistical model?

Polls of polls are simple and useful. But if we want to know who will (likely) win an election, we need mathematical models. One reason is that ballots are not directly translated into presidents and governments, so we need models that take into account things like delegates (USA) or districts (UK, Spain).

But the main reason to build models is uncertainty.

We build probabilistic models to measure (and communicate) our degree of ignorance about each election outcome. Uncertainty varies and it can be estimated if one is careful. Sometimes polls point to a clear winner —as they did in France, where we gave 98% chance to Macron against Le Pen—, and sometimes they do not —in Catalonia we gave 54% chance to a pro-independence majority, because it really was ‘too close to call’.

The model predictions are computed with four-step procedure. First, we average the polls at the national level. Second, we project those votes into each district or state, if needed. Then we do the most important step: we incorporate uncertainty based on the historic record of polls hits and misses in similar elections. To do that, we have analyzed more than 3,000 polls, from hundreds of elections in dozens of countries. Finally, we simulate the election 10.000 o 15.000 times to estimate the probability of each possible outcome.

Possible majorities in Catalan elections (2017)

As Nate Silver has written, the media has a probability problem: we demand too much certainty for a complex world.

There are many examples of overconfidence in the public debate. For instance, we tend to give excessive value to consensus. Think about Brexit. Polls showed then that britons could easily vote for leaving the EU. However, most pundits and pollsters were saying that a winning of the ‘remain’ was likely. That kind of consensus produces a (false) impression of certainty. It’s a cascade of confidence. But we know that polls and pundits are often all wrong in the same direction. That is why we fed our model with uncertainty: we use historic poll records of hits and misses to let the model know how big and how likely polling errors can be.

This way our models inform about the real uncertainty.

It is still a challenge to communicate it. After Donald Trump’s election, Amanda Cox and Josh Katz, from The New York Times, wrote that they failed explaining that “an 85 percent chance is not a 100 percent chance”.

We must find effective ways to communicate the uncertainty in the world to our readers. I believe that a good way is by learning how to better interpret probabilities. And to learn we need practice. Take football for instance: most people understand perfectly that the best team does not always win the match —They know it by own experience!— The same can be true with electoral forecasts. And that’s why it makes sense to build many probabilistic models: so that we all become experts reading probabilities.

OTHER DETAILS

Are our models accurate?

A probabilistic model is difficult to assess with a single prediction. But I can evaluate the reliability of nine models I have built since 2015. The following chart shows the accuracy of 173 predictions.

The data for the evaluation is provided as csv.

It can be seen how the model predictions are fulfilled with the expected frequency. Predictions with probability 20%-40% occurred 25% of the time; those with probability between 90% and 100% occurred every time. The model seems relatively well calibrated. If anything, they have been slightly conservative with high probability predictions, which tend to occur more often than expected —which seems sensible.

The accuracy of election polls have been also studied in a recent academic paper by Jennings & Wlezien (Nature Human Behavior, 2018). They analysed thousands of polls conducted during the last week of election campaigns in 220 national elections in 32 countries between 1942 and 2017. The mean absolute error was 2.1%, which is quite small. They also found that, “contrary to conventional wisdom, the recent performance of polls has not been outside the ordinary”.

What makes this project innovative?

Election forecasting based on polls is not new. We and other international media have been building these models for a few years. Financial Times has published several polls of polls, but no probabilistic models; The New York Times and Fivethirtyeight are building models for every USA election, and The Economist built a nice model for the French elections. But I believe that no other newsroom has published so many models, and for so many countries, as we at EL PAÍS. This allows as to better measure their accuracy —as explained above— and also makes them familiar to our audience. In addition, of course, each model contributes to an election coverage.

What was the impact of your project?

We have published dozens of articles with the polls of polls and the prediction models. Some of them were widely read. The Catalan election forecast was particularly successful: it got a lot of readers and high engagement. The final model update was the most read article at EL PAÍS the day it was published. Half of the traffic came from social networks and search engines.

The Catalan model was cited in other international media, such as Bloomberg or The New York Times. I also appeared in several Spanish newspapers, radio stations and TV channels talking about polls .In the past, the models I built for 2015 and 2016 Spanish elections were also cited in The Washington Post, The Guardian, The Wall Street Journal, Fortune o Financial Times.

Source and methodology

Sources and methodologies are detailed in each article. The main source are always the polls for each election. They are mostly published by newspapers and other media, but they can be found often compiled by others. In Wikipedia, for example, a lot of polls can be found. We also use historic data of polls for old elections. Two academic papers by Jennings & Wlezien are particularly useful: (Nature Human Behavior, 2018) and (American Journal of Political Science, 2016), which provides data of thousands of polls for hundreds of elections and dozens of countries.

All models follow a similar four-step procedure. First, we average the polls at the national level (or similar). Second, we project those votes into each district. Third, we incorporate uncertainty based on the historic record of polls accuracy in similar elections. Finally, we simulate the election thousands of times to get probabilistic prediction of possible outcomes. A detailed methodology is published with each one (see for instance the forecast of the last Catalan elections.)

Technologies Used

All models are developed and implemented with R, the language for statistical computing and graphics, and RStudio. Both tools are used to compute the polls of polls, to develop every part of the model, and to run the simulations. I also use R and RStudio to produce most charts, and Numbers (the spreadsheet by Apple) for visualization and simpler analysis.

Portfolio

My work on elections is part of my work on data journalist for EL PAÍS, which can be found here. I write a lot about polls, but also about public opinion, data on timely topics, football analytics, other sports, ideas, etc.

Contact

Kiko Llaneras / Twitter / kiko.llaneras.es