Learning From @NateSilver538's OMG-Wrong #Bra vs #Ger Prediction
Sometimes the Hedgehog Knows a Thing or Two
So, that was a bit of a shocker, to say the least. Brazil lost 7-1 to Germany, in World Cup semi-finals—this rarely happens to Brazil, and this rarely happens to any team at that level.
In another shocker, the prediction site Five Thirty Eight got it so wrong, projecting 65% chances of a win by Brazil, even after adjusting for the loss of their star Neymar and team captain Silva. It’s true the result was truly an outlier, and the site has its own explanation up, and it comes down to “this was an unlikely result.”
And it’s true that soccer is a fickle game.
But that’s not all there is here.
There are systematic reasons why such predictive efforts sometimes go so awry in this particular manner, some of which apply here.
This wasn’t a case of bad luck, a last minute goal in an evenly-matched 1-0 game, but an absolute and utter dominance by one team, Germany, which clearly did not have two in three chances of losing this match, so it deserves more examination than “this was The Most Shocking Result in World Cup History”, as FiveThirtyEight argued.
This is more than a fluke—or an errant “Black Swan” event—but rather, as I’ll argue below, this was a combination of measurement error and a revelatory cascade, mixed up, as always, with human psychology—hence an example we can learn from rather than an anomaly to dismiss. Even better, there are steps one can make to be less vulnerable and also to recognize limits of this method (every method has limits), and I’ll even suggest a few.
To start with, I should add that I’m a fan of statistical and predictive analytics in many forms and fields so this is not a “let’s never do this” but musings on how to do this better, and be more informed consumers and producers of statistical predictions.
Problem One: Ignoring Measurement Error in Your Data
First problem Nate Silver’s site suffered from is ignoring measurement error in your data, otherwise known to researchers as “map is not the territory.” All measurements are partial, incorrect reflections. We are always in Plato’s Cave. Everything is a partial shadow. There is no perfect data, big or otherwise. All researchers should repeat this to themselves, and even more importantly, to the general public to avoid giving the impression that some kinds of data have special magic sauce that make them error-free. Nope.
Remember when FiveThirthyEight had a story on how Nigerians kidnappings had increased, and most were concentrated about the capital city, Abuja? The problem was they were using GDELT, a database that uses news sources to create aggregate compilations. As such, of course, it is systematically biased: it doesn’t measure kidnappings, it measures news of kidnappings, which of course vary depending on attention, and how easy it is for reporters to report. Nearer the capital? More likely to be news. Does it mean more kidnappings overall? Maybe, maybe not. Many substantive experts piped up in objection and FiveThirtyEight later corrected their story.
The measurement error in the World Cup case was simple: FiveThirtyEight and other sites had marked Brazil as having a strong defense, and a solid offense anchored by its star, Neymar, as measured by a statistical amalgamation called Soccer Power Index. In reality, Brazil had been aggressively fouling its way as a means of defense, elbowing and kicking its way, and not getting called for it by referees. I’m not just making this up as a day-after-big-loss armchair analysis: pretty much most punditry on soccer had been clear on this before the game. [Added: Here’s a neat post I just found on perils of doing Bayesian analysis in a data-poor environment, such as this one, and how “plain evidence of the senses” and “data” can diverge.]
In other words, the statistics were overestimating how good a team Brazil really was, and the expert punditry was fairly unified on this point.
In other words, this time, the hedgehogs knew something the fox didn’t. But this fox is often too committed to methodological singularity and fighting pundits, sometimes for the sake of fighting them, so it often doesn’t like to listen to non-statistical data. In reality, methodological triangulation is almost always stronger, though harder to pull-off.
The Fix? Find Experts You Trust and/or do Qualitative Pull-Outs.
Instead of the aggressive pundit-versus-data stance taken by some big data proponents, it’s important to recognize that substantive area experts are often pretty good at recognizing measurement errors. (Exceptions are cases like the 2012 election where I solidly came down on the side of Nate Silver type aggregate analysis as the pundits had an incentive to make the race a lot more closer than it really was. In fact, any political scientist worth her salt had put the race down to Obama around July, barring huge, surprise events, hence the “October Surprise”.)
If the substantive experts are deemed unreliable, another option is “qualitative pull-outs” of your data to check for measurement error. Watch a game with, say, three experts, and count the uncalled fouls and specious, undeserved, penalty shots as judged by the experts. This can even be quantified as an index of measurement error based on qualitative examination (which will have its own measurement error because it’s turtles all the way down, folks—but intercoder reliability, technical way of saying “how much we all agree” can give a sense of scale of error.)
Another example I like to give for measurement error by big data are hashtag analyses: for example of the Gezi protests in Turkey. I was on the ground during the event, doing research, interviewing, and also monitoring social media. People stopped using hashtags at the height of the protest because the topic dominated the conversation online! That form of measurement and reality completely diverged, leading to huge measurement error issues if you were going by hashtags, as many researchers were.
Such misfires of measurement between offline reality and online imprint, between what the team statistics tell us, and what the team really is like, are common, mundane and unavoidable. Nobody should be blamed for their existence, but everyone should be on the lookout for them. It means that there is an unquantifiable error field around all our statistically calculable error bars. This is hard to quantify, but perhaps predictive sites can add another index: how confident are you in the prediction, the way some weather sites now do. The more measurement error you suspect, the lower the confidence in the prediction. Yes it is messy and not as neat as a percentage we can trot out, but life is messy.
Problem Two: Ignoring Field (or Broad) Effects
Many data analytic efforts look only at the network (or the team at hand) without considering how events affect the whole field.
For example, many analysis of the initial Egyptian uprising which looked at Egyptian networks of information are ignoring a major event that impacted the whole field: the Tunisian revolution. Let’s say Asma and Bilal in Egypt are talking about joining a protest in very early January of 2010. The same Asma and Bilal talking about joining a protest one week later, after Ben Ali fled Tunisia, are not the same Asma and Bilal because what happened in Tunisia has affected them both. The whole calculus of possibility has been altered because of Tunisia: the network is not the same network that can be analyzed in simple continuity.
Similarly, the Brazilian team without Neymar and Silva isn’t just the Brazilian team without two good players, to be replaced by other reasonable players, as the FiveThirtyEight suggested in its analyses and tweets.
This was a team that had been subjected to enormous psychological pressure, and acted like someone had died in a catastrophe. Since humans have this thing called psychology, it’s not easy to run data analysis by replacing one human with another as it were Lego pieces and looking at the resulting structure. It rarely works that way. In fact, the original analysis of FiveThirtyEight mention this factor, but their predictive score seems completely unaffected by this reality.
The Fix? Recognize Frailties in Studying Human Endeavors
This one is hard because this is a structural feature of most human endeavors—and one that disciplined efforts like militaries and organized sports try to minimize through extensive training and drilling so people do act like cogs in a machine, even under pressure. Still, though, it’s hard to fully account for. In other words, FiveThirtyEight as a site could get as brilliant as it wants, but this would still add a hard-to-quantify error term to the equation. It’s okay; every method has strengths and limitations and this one affects all methods of studying humans. It’s a reminder to all of us not to oversell one method.
It should be possible, though, over time, to run analyses of how often standardized predictions were wrong, thus developing an index of reliability of a statistical method. Different methods for different fields will have different indices, allowing us to gauge unpredictability. For example, more disciplined and higher-scoring games, say basketball, likely have MORE <corrected typo that said less!> predictive reliability than soccer. This suggests that analytic teams should keep close track of how wrong they are, and quantify that into their predictive models.
Problem Three: Humans Are Not Gases in a Chamber, but reflexive beings who react to events.
A problem with much statistical analysis is ignoring the fact that humans, umm, react to things around them. (Social science jargon for this is reflexivity). I know this seems so simple, but it’s amazing how much predictive analytics don’t factor this in.
For example, I sometimes get summoned to participate in crisis informatics projects, some of which are fine, but many of which are projects that aim to do predictive analytics, trying to predict outcomes like who will be a good source of information during a crisis, say an earthquake. As I repeatedly tell such project owners (many of whom are vying for boondoggle contracts from the government), you cannot predict how specific people will behave after a significant trauma, such as living through a major earthquake. You can’t know who will remain calm and collected, who will be destroyed and unfunctioning, and who will have juice left in their battery to be able to act as an information conduit. It’s too stochastic. There is value in studying crisis informatics, but little value in such predictions.
Similarly, a soccer game is composed of humans reacting to events, hence the prediction one makes at the first minute isn’t going to work the same way once the game starts because players will react to what happens in minute eleven! In psychological ways! Anyone who watched the game could see this, after the first goal, the Brazilian team unraveled, causing more goals, which resulted in a revelatory breakdown.
Revelatory breakdowns are interesting because they reveal things, rather than act like “Black Swan” events which come from nowhere and don’t necessarily tell us things we did not know or should have known. Instead, these types of events are rare, but not unheard of by any means. In fact, similar seemingly unpredictable cascades but obvious-after-the-fact cascades haunt the study of uprisings, revolutions.
Why did the Iraqi army decided to completely turn tail and run all at once, seemingly, once ISIS attacked in Mosul? Why did the Shah of Iran fall so quickly in 1979 after seeming so stable just a few months earlier? Mubarak’s fall in 2011? The list is long.
Here’s the crux of the issue: such cascades of behavior are often hard to predict, but once they happen, they seem inevitable. Brazil did not have that strong a team, and Iraqi army wasn’t really coherent, and Shah of Iran was widely hated, and Mubarak was a corrupt autocrat.
But nobody predicted all this the week before it happened.
What gives?
This cascading behavior has to do with something political scientists call pluralistic ignorance. This is the scenario in which I’m willing to withhold my private belief that Shah of Iran is awful, while pretending I’m a regime supporter. Once, however, a few people start a cascade, an offline or online protest, revealing their true beliefs, it becomes easier for me to join them, and hence strengthening the cascade. The feedback loop than creates a rapid cycle of events that culminate so quickly that it all seems inevitable, yet also appears seemingly out of the blue.
That is why the event revealed by the cascade seems so obvious after-the-fact. If you think about it, it’s clear that this is actually a subset of measurement error. Brazilian team wasn’t that good and most Iranians wanted the Shah out. Bad refereeing in the former, and repression in the latter, made it harder to recognize this measurement error because the former hid lousy play while the latter hid private dissent. But, once again, substantive area experts are almost always aware of this reality: unfortunately, it doesn’t help predict when a cascade will break through. The event may seem predictable, but timing, and if, almost certainly are not.
The Fix? Recognize that Stability and Instability are not that far apart for structural reasons
This, too, comes under “hard to predict exact timing but not hard to substantively discuss possibility before it happens” category. It means recognizing that models that are snapshots in time are just that—and that events themselves carry force in feedback loops. Sports analytics have often debunked so-called “hot hands” when players erroneously believe that there is some metaphysical reason for their streak of luck when it is just that—luck and ordinary odds. However, the kind of unraveling the Brazil-Germany World-Cup witnessed is not ordinary bad luck, or a mere “unlikely event” as FiveThirtyEight had it. These events are combination of measurement errors and human psychology exposed in seeming phase-transitions from a stable dictatorship to a revolution, a World Cup favorite to a second-rate team, almost in a blink. They are interesting, and rare, but I wouldn’t call them anomalies as they are a regular feature of human social and group endeavors.
Overall, I hope, it is clear that my point is that statistical and predictive analytics of human collective behavior have real strengths and weaknesses. (Here’s my academic paper that makes some of these arguments here in more depth). Methodological awareness of limitations and calls for multi-method triangulation are not attacks against a method, but the way to make them stronger and more resilient.
I’m not suggesting that this was an easy-to-predict result. Not at all. However, there are some systematic ways that statistical predictions fail, and we can learn from them, rather than dismissing them as mere outliers.
But tweeting OMG is okay. We all did.