Boosts and bumps in polling
In a previous post, I described how to move from polls on the final day of an election campaign to actual outcomes.
What I didn’t talk about was how to move from polls on the twentieth, or thirtieth, or the fiftieth day of an election campaign, to polls on the final day.
Both tasks involve similar problems. In both cases, we need to ensure that our predictions add up to 100%.
When dealing with final-day polls, I dealt with that problem by using a particular statistical distribution. That solution works, but it’s only really suitable for the last day. If you really amp up the uncertainty, then the predictions escape the bounds suggested by that distribution.
This means that we need a solution for long-range election forecasting — and in particular, long-range forecasting in a multi-party context.
To do that, I need to convince you that to forecast how six (or seven or eight) parties will do, you need to focus on modelling five (or six or seven) outcomes.
It might seem strange to model one fewer outcome than there are parties, but there's a commonly used version of this strategy which generally escapes notice. Sometimes, people model the top however many parties, and then create a left-over category — the Others — which doesn’t have an independent existence, but is just one hundred percent minus the sum of predictions for other parties.
Sometimes, that strategy can work. It breaks down badly where the main parties win over 100% of the vote. Then, the Others end up winning negative vote share.
The strategy I’ll use is one that’s more common amongst geologists than political scientists. It involves identifying one party as a reference party, and working not with vote shares, but ratios of vote shares compared to the vote share of the reference party. More specifically, we’ll be working with log-transformed ratios.
We owe this approach to a Scottish statistician, John Aitchison, who sadly died at the end of last year. Aitchison realised that log ratios manage to encode all of the information about shares, but don’t suffer from the same problems of interdependence that simple shares do.
How does a log ratio work? Let’s take the Conservatives are our reference party (it doesn’t matter which party we choose — we’ll get the same predictions out at the end), and let’s suppose that the Conservatives are on 45%.
If Labour are on 30%, then the ratio between the two parties is 30/45 = 0.667. If we take the log of this, we get -0.4. (If Labour were ahead of the Conservatives, the log would be positive).
Here's a table which shows how these would work for GE2015 and current polls, together with the changes they imply.
We're going to use those changes in log ratios, and it'll look a little different to the way we used vote shares last time…
Last time on electionforecast.co.uk…
When I was involved in election forecasting in 2015 (everything went fine, just peachy, no need to ask), we used a linear regression model to explain changes in parties' vote shares as a function of changes in parties' poll shares. It led to my favourite .gif of the 2015 election campaign.
We got changes from historic polls, and actual shifts, for three parties in eight elections and tried to draw the best-fitting straight line which went through those (3x8=24) points for each day. As we get closer to the election, the slope of the line gets steeper. You can think of that slope as the weight to place on changes in the polls. The closer to the election, the more weight you can place on the polls — not because of any methodological changes in the polls, just because there’s less time for public opinion to shift.
That line has the standard equation for a straight line, with a slope (or gradient) and an intercept. In our case, we had to force the intercept to be zero. (If the intercept wasn't zero, that would mean that all parties either gained or lost vote share, and that's just not possible).
With this type of regression, you can't say whether any parties generally improve or worsen over the course of an election campaign. We know that parties that are up in the polls compared to last time generally fall back, and parties which are behind generally catch up — that's just what's implied by discounting changes in the polls. But that doesn't tell us whether, say, the Lib Dems generally have a good run in, or whether Labour generally falter. If we allow for specific parties to have these effects, then we run the risk of creating predictions that enter that forbidden zone.
Regression with log ratios
If we run a regression with (changes in) log-ratios, we can incorporate these effects. Instead of running one jumbo regression, we'll run a series of (related) regressions. Here's what they look like for day 65 before the election.
You'll see that there's not much of interest going on for Labour, but that the two plots for the Lib Dems and all others are more interesting. For the Lib Dems, the regression line (the solid line in the picture) is almost always above the dashed line which indicates a 1:1 match between the change implied by current polls and the change in final polls. That means that we should expect the Lib Dems to do better that current polls imply, and that we should discount the polls a little.
Conversely, the line for all others is almost exactly straight, but always below that 1:1 line. That means that the polls are really useful, but that the others are still going to fall back.
We can carry out this analysis for each day of the election campaign, from day zero, one year before the election, to day 365 (the day of the election). Here's what the graph looks like (click to embiggen).
The top row shows the weight you should place on the polls (the slope) when trying to predict the final day's polls. The horizontal line shows the point at which you should place full weight on the polls.
The bottom row shows the additional bump each party should expect to get, independently of how they are faring in the polls. The horizontal line shows the zero point, at which a party neither benefits nor falls back.
In both rows, the solid vertical line shows where we are right now, 25 days before the election.
As you can see from the top row, the weight to place on the polls increases over the course of the campaign, except for the Others. However, for the Lib Dems (relative always to the Conservatives), polls are remarkably uninformative as close as two months to the election.
The bottom row shows that part of the reason why polls aren't very informative about Lib Dem prospects is that they generally get a sizeable boost (relative to the Conservatives). The opposite is true for the "Others", who get a sizeable penalty. Labour generally under-performs (i.e., has bad campaigns).
What does this mean?
What does this mean for the present election? At the moment, we're 25 days before the election — day 340 on the graphs above.
This means that we can expect, on the basis of past polls at this point of the race, the Lib Dems to do better in the final polls than they are presently doing. (This does not necessarily carry through to the actual result, as the 2010 election showed). Given that they are currently doing rather badly, this will be welcome news.
This does not mean that the Lib Dems will do tremendously well. At the moment, electionforecast.co.uk shows that the Lib Dems will increase their vote share without necessarily increasing their haul of seats.
It also means that we are at the point at which polls rapidly become more informative about final-day polling. Maybe that's because this is the point at which people start paying attention to the election. Maybe it's because this is the point at which parties' policy offer crystallizes (the publication of manifestos is expected next week). Or maybe it's because this is the point at which we run out of "game-changers", or "dead cats", or other such stuff and nonsense.
I've now tried to explain how I move from final polls to election results, and from current polls to final polls. There are some steps that I've skipped over, like the treatment of the SNP and Plaid Cymru. I've tried to do this without talking explicitly about seemingly unrelated regressions on changes in additive log-ratio transforms. In a future post, I'll try and explain how I make my seat-level predictions.
(Nerd note: I haven't posted any data for this post, but all of the regressions were run using the systemfit package for R)