Time to stop tipping
Leicester City, Donald Trump, the Chicago Cubs, Brexit, even the Western Bulldogs and the Cronulla Sharks here in Australia… so many events occurred this year that “no one saw coming”. But is this really a fair statement? Most often we simply don’t have the right amount of information to make that call.
Take Nate Silver and his well-publicized forecast for the Electoral College votes in last Tuesday’s US elections. On the morning of November the 8th it read:
Hillary Clinton: 302.2
Donald Trump: 235.0
This is a projection for the average outcome and the only statement here is that, if we could repeat these elections a very large number of times, Trump would average 235 electorate votes per election and Clinton 302.2. Now even in hindsight (latest results have Trump winning 290 votes to Clinton’s 232) there is not much wrong with that projection. The actual results are admittedly far from the model’s expectation but this could really be down to (bad) luck rather than a flaw from the model. A better question to ask is how unlikely this event was, according to Silver’s model? If last Tuesday’s results were not even in the range of outcomes considered possible then we have a clear proof that the model was wrong. In any other case more information are needed before any conclusion can be drawn.
To Nate Silver and his team’s credit they did provide information about the modeled distribution of outcomes:
What to do with that? Probably not much at this stage, but the one thing the forecast does show is that Trump winning 290 votes or more was an outcome that the model considered probable, albeit rather rare. Similarly, Trump did exceed expectations in terms of popular votes but not to an extent that was unconceivable according to this model. The only conclusion we can make from knowledge of only that 2016 election result is that Silver was not blatantly wrong. Any deeper analysis will require more data points.
Moving on to the less depressing world of sports, the year has been very rich in fairy tale headlines of underdogs winning against all odds. As we’ve all heard by now, Leicester City won the 2015/2016 English Premier League and clearly no one saw it coming… but should anyone have? Absolutely not would be my answer. The fact that Leicester did win the league should not disguise the fact that their chance of doing so were likely very (very) small. Any model predicting a first place finish for Leicester back in August 2015 would have had to consider it the most likely outcome out of all possible table position finish. Even in hindsight it seems impossible to identify any early signs that would have led to such a bold prediction. Leicester surprised everyone, yes, but were also very lucky and that luck component was by definition not predictable.
So how can we judge predictions? First we need to ask for more than straight up tips. In the spirit of the fivethirtyeight.com figures above a modeler should ideally provide a likelihood for every possible outcome. In the case of Leicester I suspect most models actually had a 0% likelihood attached to Leicester ending the season as champions, but in most cases we will never know as the only thing we recorded is that they “tipped” another team for the title.
Tony Corke, modeler and blogger behind the excellent matterofstats.com website, routinely provides heat maps of projected table positions for Australian Football League (AFL) teams. The Figure below shows such a projection after 14 rounds had been played in the 2016 season.
This is a much more honest type of prediction than a straight tip as it fully embraces how much uncertainty surrounds the outcome. If asked to pick who would top the table at the end of the season Tony would have had to go for Geelong at that stage (although he estimated there was a 54.2% chance they would not), throwing away most of his hard work in the process. Yes it’s great fun when we can claim to have “called the result”, but really tipping a single outcome does not make any sense.
The flip side of this argument is what most people perceive as a lack of commitment: by putting a non-zero probability to all possible outcomes a modeler can always get away with blaming luck. While this is true when we only consider a single result in isolation, the excuse will struggle to hold once enough data is gathered. Think about Leicester City here again and consider we had a model with the same level of details as Tony’s. Suppose that model had given Leicester a 1% chance of winning the league back in August 2015 (which would have been pretty optimistic at the time). We might get away with arguing the Leicester title was a 1 in 100 outcome but what if we had also given a 1% chance to the teams eventually wining the Spanish, German, Italian and French leagues that year? And what if something similar had already happened the year before? The run of luck would start to look extremely unlikely, and so would the idea that the model has any real skills. The true skill of a model can only be identified once we have gathered a large volume of predictions and results, at which stage blaming luck won’t be enough to disguise model flaws.
Unlike many other fields where predictions are routinely made, sports can quickly provide the large amount of data points needed for model evaluation. Yet we still mostly stick to boldly tipping winners and catchy statements such as “no one saw it coming”. Even more frustrating is the fact that the way forward is not new, and has been successfully documented and implemented in forecasting tournaments by Philip Tetlock, including in his work for the US Intelligence Advanced Research Projects Activity (IARPA). Here are some selected quotes from his book “Superforecasting: the art and science of prediction”:
“Fuzzy thinking can never be proven wrong. And only when we are proven wrong so clearly that we can no longer deny it to ourselves will we adjust our mental models of the world — producing a clearer picture of reality. Forecast, measure, revise: it is the surest path to seeing better.”
“Consumers of forecasting will stop being gulled by pundits with good stories and start asking pundits how their past predictions fared — and reject answers that consist of nothing but anecdotes and credentials.”
“if we are serious about measuring and improving, this won’t do. Forecasts must have clearly defined terms and timelines. They must use numbers. And one more thing is essential: we must have lots of forecasts.”
“Law of physics aside, there are no universal constants, so separating the predictable from the unpredictable is difficult work. There is no way around it.”
“More often forecasts are made and then … nothing. Accuracy is seldom determined after the fact and is almost never done with sufficient regularity and rigor that conclusions can be drawn. The reason? Mostly it’s a demand-side problem: The consumers of forecasting — governments, business, and the public — don’t demand evidence of accuracy. So there is no measurement. Which means no revision. And without revision, there can be no improvement”
After all the debate around prediction models this year, it’s probably time we start embracing this philosophy in main stream forecasts too. Here is how we might get started:
Let’s stop tipping winners.
Instead, let’s aim to quantify the likelihood of all outcomes (including the very unlikely).
Let’s start recording everything, predictions and results, with great detail.
Let’s be proven wrong, then let’s reflect and learn from it.
well basically: “Forecast, measure, revise: it is the surest path to seeing better.”
“Superforecasting: The Art and Science of Prediction”, 2015: Tetlock, P.E. and D. Gardner. New York: Crown.
Articles by Tony Corke for the guardian Australia:
FiveThirtyEight 2016 Election Forecast:
Also see Simon Gleave’s prediction tournament for the English Premier League to get an idea how lowly experts, fans and modelers ranked Leicester City’s chances in August 2015: