General Election Polling: the Maltese Context
Separating Science From Voodoo: Know Your Pollster
Sampling is born out of the inability to ask every single person how he or she intends to vote on June 3rd. The underlying principle of any poll done right is the random sample, or the probability that you are about as likely to be chosen as your next door neighbour, someone across the island, or even the Prime Minister himself.
To do this, pollsters have traditionally relied on databases of telephone numbers, selecting participants randomly. Telephone numbers are a good two birds with one stone technique since almost every house hold has them, and you can carry out your survey through them.
Yes, telephone polling is not without it’s limitations; younger or working people tend to be underrepresented, and somewhere, someone still doesn't have a telephone. But landline polling has been in use for over half a century, and scientists know about these limitations, and how to account for them.
Once you have enough samples accumulated in this manner, you can suppose that the variation in your sample for any given viewpoint is reflective of the variation of the population as a whole. Even then, it’s not a precise guess, and this is why we have a whole metric which uniquely measures variation that might not have of been accounted for: margin of error.
The three organisations which reliably use telephone polling locally are MaltaToday, The Malta Independent and Xarabank.
What you must take with a grain of salt, if not completely ignore, are internet surveys that disseminate surveys through mediums like Google Forms. While it can be argued that almost everyone has internet access, the people who click on these surveys are (1) self selecting members of a (2) subgroup of the population that is your readership or follows your Facebook page which (3) probably only caters or appeals to a niche of the general population of the whole.
This is a gross violation of the core pinnacle of polling, and anything deduced cannot be generalised to the population as a whole, no matter how large the sample size. A proper survey done right needs only around 450 responses for a 4% margin of error. Because margin of error is inversely proportional to sample size, many internet surveys claim a 1.5% margin for a 3,000 respondent survey — this supposition is technically true…for a random sample.
And this is not even considering things like hardcore members sharing the survey with their partitarji friends to influence the result, or even deploying machines. If you think that’s unlikely locally, Times of Malta had to CAPTCHA protect their weekly strawpoll earlier this year.
One last point in this regard: several French polling firms like Ipsos have had tremendous success using the internet. What they do however is get a list of email addresses and treat it like a telephone directory, picking email addresses randomly and emailing surveys.
The Many are Better than the One
Strength in numbers works for numbers too. Polls (from here on that means only telephone probability sampled ones) naturally fluctuate, and over-reading these natural peaks and troughs is almost natural.
But aggregate many of them, and you start having a clearer picture by fitting a trend line. This is essentially the approach of the data science offshoots many American newspapers now have. Here are three methods that we can use to better look at these aggregates.
Moving averages are great at picking out the general shape and direction of a trend, but they tend to fluctuate rapidly. Here is a moving average of PL and FN for the last 13 month period, starting with an April 2016 Malta Independent poll and ending with a May 2017 Malta Independent poll released earlier today.
When you have a few more data points, you can start to do fancy things like fit linear regression lines to your scatterplot. The below example is the same graph as above, but with a straight line of best fit through all the points.
This method is used by HuffPollster among others when you have enough datapoints to let you do something more fancy than a moving average, but not enough to do something really fancy.
The limit of this approach is that a straight line will never fit perfectly to political data like this. The R2 number below the legend is actually an indicator of how well the line fits to the data, with a perfect fit denoting 1. The line fitted to PL’s scatterplot actually performs decently in this regard, however PN/FN’s fit of 0.38 is pretty poor, and stems from the inability of straight lines to account for all the spread in FN’s polls.
Given enough datapoints, we can perform a more complicated procedure that gives us a curve, eliminating some of the above problems. There currently isn’t enough recent polling data to end up with a decent result, so I had to go back to January 2015.
Just a caveat here: since this makes use of such old numbers, it’s pretty meaningless, I intend it only of a demonstration of what can be done, given enough polls are released.
The moving average’s indication is 5.5% in favour of PL(51.8–46.3). The Linear Regression is about the same (~6%+PL), and this is from polls whose margin of error is in the 3% to 4% region. Margin of error isn’t that much of a concern here either, since it’s pretty unlikely that Xarabank, the Independent and MaltaToday would have all been consistently reporting a PL lead when there was none.