We already had a Sokal moment in “data science,” much huger than Sokal, too, called “Landon defeats Roosevelt.”
The basic problems are disturbingly similar to what often takes place today: the data assembled by the Literary Digest, the magazine that ran the survey, was huge, its methodologies were supposedly validated by its success in predicting previous presidential elections, and so forth. The big problem that was neglected was the sampling problem — while the previous data pools used by the magazine in previous elections were reasonable approximations of the electorates in the respective periods, the situations specific to 1936 ensured that the data sampled and the likely electorate were nothing alike. Ironically, a guy with very small sample size, better hunch of what was going on with the changing relationship between the sample and the reality, and who played off of his hunch somewhat “unscientifically” — George Gallup — made his name by predicting the results far more accurately than the Big Data (TM) assembled by Literary Digest.
We don’t seem to have really learned from this. The same problems showed up in 2016, and to a lesser degree, 2012 (Obama’s votes were underpredicted, by a couple of percents, in the polls which should have drawn attention because a small number of Obama voters would switch to Trump four years later. While he was predicted to win, it should have been much closer than it turned out. Rove knew this, and that is why he was apoplectic when the results turned out to be much less of a contest than was expected. But the gullible public decided that Rove was foolish and the “big data” correctly predicted the results — notwithstanding the very significant — especially in retrospect — errors.)
The real benefit from data is not that it helps confirm what we think about the reality, but the different ways that our preconceptions are wrong. The more learned we are, the subtler the ways we are wrong are — but they also provide clues as to how our present thinking might be wrong in a much bigger, spectacular fashion in the future when things change. But this is often lost in the triumphal talk about how data is changing the world.