Statistics can’t predict big changes very well
A lifetime ago I was at a party in DC. Out on the front steps I met a pollster from the Hillary Clinton campaign and we had a fairly heated discussion about her chances of winning. I was concerned about the Brexit effect — shy Trump voters and a different voting population than we’ve seen in recent memory. He was confident in his campaign’s internal polling because of its superior likely voter model. At the end we agreed to disagree.
I bring this up because it highlights one of the most important caveats of big data analytics and statistical models in general: statistical models cannot predict futures that don’t look like the past. Statistical models (such as those employed by FiveThirtyEight and The Upshot) use the past to predict the future. This can be done in a variety of ways, some more intricate than others, but at the end of the day all statistical models share the same core strategy of using past relationships to make forecasts.
These models work as long as those relationships hold true. For example, a model might predict income (predictand) based on the number of years someone spent in school (predictor). There may be a positive correlation here: the longer someone is in school, the higher their income will be. You might imagine a model like this that uses data from 1990–2010.
It’s not hard to imagine that in the future this relationship won’t be as strong. Perhaps between 2011 and 2020 income falls flat around 5 or 6 years in school, effectively meaning that people with PhDs aren’t making any more money than those with Masters.
How does a model cope with such a changing relationship? Changing trends can be incorporated into models if they’re known when the model is made. They can even be incorporated, to an extent, if they’re suspected; that is, you can hedge against uncertainty. However, if the relationship between the predictor and predictand changes unexpectedly in a way that wasn’t found in the initial set of training data or accounted for in some other way, then the model can’t make an accurate prediction.
This is the Achilles’ heel of statistical models: the less the future looks like the past, the harder it is to predict.
We’ll continue to use statistics to its fullest capacity; it’s an incredible tool that allows us to see and predict the world in incredible ways. But perhaps now we’ll be more cautious about how we interpret statistical results, and that’s not necessarily a bad thing. Science is rooted in a healthy skepticism of the world and data science shouldn’t be any different.