Final estimates of the Leave vote, or "Areal interpolation and the UK’s referendum on EU membership"
I’ve published a journal article
I’ve had a new academic article published in the Journal of Elections, Public Opinion and Parties (JEPOP). It describes the methodology I used to generate my estimates of how constituencies voted in last year’s referendum on membership of the EU.
Because the estimates associated with the article are very slightly different to the estimates I’ve previously published, I wanted to mention very briefly some of the differences between these estimates.
Before doing that, I wanted to thank the editors and reviewers of JEPOP for reviewing the article so quickly and so diligently. The article was first submitted on the 22nd August. I got two rounds of reviews, before final acceptance in the third week of January. (The article has been in production since then). That may not sound quick, but by the standards of social science peer review, it’s lightning fast.
Since the referendum I’ve published three sets of estimates:
- an initial set of estimates produced the week after the referendum;
- a second set of estimates produced in August, and upon which basis I initially submitted my article to JEPOP;
- a third set of estimates, which resulted from changes suggested during the review process
The first set of estimates was based on a linear regression. I built a statistical model which explained the Leave share of the vote in each local authority area using demographic characteristics. I then used that model to extrapolate from the demographic characteristics of Westminster constituencies.
The problem with this first set of estimates was that some estimates were wrong. If a constituency overlapped perfectly with a local authority, this method wasn’t guaranteed to produce the (known) local authority results.
I fixed this problem with the second set of estimates. I built a statistical model which “explained” the number of Leave and Remain voters in each local authority area. I then used that model to extrapolate from the demographic characteristics of groups of Census Output Areas. I then divided or multiplied these extrapolations as appropriate to make sure that they added up to the local authority totals, before adding these scaled extrapolations up to Westminster constituencies.
The problem with this second set of estimates was that some of the demographic characteristics I had not accounted for some relationships between demographic variables. High levels of graduate qualifications mean something different in older constituencies compared to younger constituencies. Older people had fewer chances to go to university, because participation in tertiary education was much lower when they were growing up. If, despite this, an older constituency has lots of graduates, this may matter.
I fixed this problem with the third set of estimates, which included an interaction term between age and the proportion of the population with higher educational qualifications.
Fortunately, this last change (which emerged during the review process) did not affect the overall story. These estimates are all very similar. The graph below shows, in the lower left, the pairwise scatter-plots of the different sets of estimates, and in the upper right, the correlation between the sets of estimates. The correlation between the second and third sets of estimates is very high indeed.
You can find this third (and final!) set of estimates at Google Sheets. An archival version has been deposited with the Harvard Dataverse, together with the code necessary to produce these estimates.
If you want to use these estimates, I’d ask that you do two things.
First, I’d really appreciate it if you could cite my article.
“Areal interpolation and the UK’s referendum on EU membership”, Chris Hanretty, Journal Of Elections, Public Opinion And Parties, Online Early Access, http://dx.doi.org/10.1080/17457289.2017.1287081
Second, I’d like you to say “probably” before you talk about how a constituency voted, unless I’ve flagged up a result as being known exactly. I don’t have confidence intervals for my estimates: there’s no clear statistical theory underpinning the scaling step. You can look at the empirical distribution of errors from councils which have declared results. Or you can just say, “probably”.