Co-authored by Tom Liptay and Michael Story
There has been a lot of criticism of experts’ failure to forecast COVID-19. This post explores if that criticism is justified, how to properly judge forecast accuracy, and how to hold any forecaster accountable.
Judging experts is difficult because very few have made scorable forecasts. Most of the expert analysis reported in the media surrounds modelers, who make assumptions about the virus, testing, social distancing, etc., and then tell us how many people will die if those assumptions are true. This is incredibly useful for thinking through policy response and planning, but it is not a scorable forecast. The modeler isn’t saying what they think will happen; rather, they are saying assumptions A, B, and C imply outcome D. This is an easy distinction to miss with many articles attributing a forecast to a modeler and the modelers saying we didn’t make a forecast, just a model.
Another type of forecast is a point estimate. For example, how many confirmed cases of COVID-19 will there be in the United States next week? An expert providing a point estimate of 600,000 is potentially useful, but it is hard to hold the expert accountable. If the actual number comes in at 610,000, was the expert’s forecast good? What if it were 601,000? Or 710,000? We don’t know because the expert did not tell us any information about his uncertainty. Were they 90% sure the outcome would be between 599,000 and 601,000? Or 90% sure the outcome would be between 10,000 and 1,000,000? In situations like COVID-19 where there are so many unknowns, being off by a factor of 2 might be considered accurate. In other cases, making an error of 1% might be bad.
To fairly assess experts’ accuracy and calibration, we need to ask experts for their probability estimates or confidence intervals on resolvable questions. For instance, we can ask an expert to assign an 80% confidence interval to the number of cases next week. They would provide a lower and upper bound, such that they believe there is an 80% chance that the outcome will be between those bounds (and a 20% chance that the outcome will be outside the bounds). Alternatively, we could ask a question: Will the number of cases exceed 100,000 next week? The expert would answer by providing a probability that the answer will be yes. These kinds of questions allow us to assess forecasts in a way that point estimates do not. Specifically, we can measure an expert’s calibration, i.e., do 80% of the outcomes lie within his 80% confidence intervals? Very few experts are providing forecasts that can be scored, making it quite difficult to hold them accountable.
One exception is the commendable survey of experts administered by Thomas McAndrew and Nicolas Reich from the University of Massachusetts. Every week they have been asking a panel of experts from industry and academia to provide confidence intervals and probabilities to a range of COVID-19 questions. Both the UMass researchers and the experts involved deserve praise for taking this brave step towards forecast accountability. Individual forecasts are not provided publicly, but it does appear that the anonymized forecasts were provided to 538 for their series of articles on the survey. This allows us to measure their calibration.
On the surface, the outcome for the March 29 US case count looks bad for the experts. Only 3 out of the 18 experts had the outcome of 122,000 -139,000 (depending on whether you prefer the CDC or the Covid Tracking Project as a source) cases within their 80% confidence interval, and all estimates were too low. Critics argue that this points to miscalibration. After all, only 3 / 18 = 16% of the forecasts had the outcome within their 80% confidence interval. Doesn’t that demonstrate miscalibration? Not necessarily.
Imagine we asked 18 expert gamblers for their 10/90 confidence intervals on the spin of a roulette wheel with the numbers 1 to 100. (A properly calibrated forecaster will provide a 10/90 confidence interval such that 10% of the outcomes fall below his lower bound, 10% fall above his upper bound, and 80% fall between bounds.) They might all say 10 to 90 — which is the objectively correct answer. Now imagine we spin the roulette wheel and the outcome is 98. The outcome will fall outside of all 18 of the gamblers’ confidence intervals. Does this imply the gamblers are miscalibrated? No. In this case the outcome of all 18 forecasts is perfectly correlated. While we have 18 forecasts, they are effectively only 1 uncorrelated forecast.
Another critique is that the upper bounds on 14 of the 18 forecasts were less than 50,000 cases, which is quite far from the outcome of 139,000. This seems to have some merit. But we can imagine a well-calibrated expert who thought about fat tails and could have had a 90th percentile upper bound of 50,000 with a 95th percentile forecast of 140,000. This hypothetical expert would have had the outcome fall within his 90% confidence interval. While we think it is unlikely this was the case, we can’t rule it out.
No forecast is definitively ‘wrong’ unless it was 0% and happened or 100% and didn’t happen. Interestingly, we can never know if a forecast was right even if someone forecasts 100% and the event occurs. Imagine forecasting a coin flip. We know the correct forecast is 50%. If someone forecasts 100% chance of heads and it comes up heads, they will appear to have perfect accuracy. But their forecast is still objectively wrong and their forecast accuracy will be poor over the long-term. Drawing a conclusion from a single forecast is close to meaningless.
As the number of independent forecasts increases, we can empirically measure with increasing confidence whether an expert or group of experts is well calibrated. In other words, when they say there is an 80% of an outcome falling within their confidence interval, does it actually happen 80% of the time? A common way to visualize this is to plot a calibration curve with forecast % on the x-axis and outcome % on the y-axis. A perfectly calibrated forecaster will have points along the 45 degree diagonal. An overconfident forecaster’s data points might slope upward but at a shallower angle, i.e. when they forecast 90% the outcome occurs 80% of the time.
While the UMass survey has 38 expert forecasts that have resolved, those forecasts were only on 2 questions — so we can’t draw any meaningful conclusions on expert calibration yet. It is possible that as more questions are resolved, we will find that indeed the experts are well-calibrated. In fact, when the UMass survey was repeated closer to the March 29th deadline and the previous forecasts had been revealed to be too low, the experts updated their forecasts in favour of a higher case count with a wider range.
However, even if experts prove to be well-calibrated, critics might legitimately point out that it is possible to be perfectly calibrated while still knowing nothing useful.
Imagine forecasting whether it will rain in London on a given day. Furthermore, let’s assume that it rains in London 50% of the days each year. Without ever looking at a radar, I can forecast 50% every single day and I will be perfectly calibrated, i.e. when I forecast 50% it happens 50% of the time. But I haven’t offered anything useful, because being properly calibrated is only part of what it takes to be a good forecaster.
We also want our experts to have good discrimination, meaning they can differentiate between days when it is more likely to rain, versus not rain. A weather forecaster who always says either 40% or 60% chance of rain AND is properly calibrated is both more accurate and has better discrimination that someone who always says 50%. A weather forecast with better discrimination matters a lot when you are planning a picnic. One way to measure accuracy is to assign Brier scores to forecasts when the outcome is known.
Mathematically, a Brier score for a yes/no question is analogous to the mean squared error of the forecast multiplied by 2. The best Brier score is 0 and the worst is 2, while a 50% forecast always results in a Brier score of 0.5. Importantly, Brier scores are “mathematically proper” so a forecaster should always forecast his true belief in order to get the lowest expected score. Lower is better, like golf. In fact, meteorologists were among the first to use this scoring technique and as a result were able to improve their accuracy (despite their reputation).
So, once the outcomes to COVID-19 questions are known and we can assign Brier scores to the experts’ forecasts, then we’ll be able to measure their accuracy, right? Well, it is still tricky — a Brier score without a benchmark doesn’t necessarily tell us much about the ability of the experts in question. For instance, a meteorologist forecasting in the Sahara desert might forecast a 3% chance of rain every day and receive an impressively low Brier score — but given that it rains very infrequently in that part of the world, we shouldn’t be too confident in their meteorological prowess since the score varies by the ease of the question, not just the skill of the forecaster.
To determine the skill of individual experts, we could compare their individual Brier scores against one another to determine who is the most skilled. However, if we are trying to assess the skill of the experts overall we need a benchmark to compare them against. In fact, we are doing exactly that at Maby by forecasting publicly (on twitter) the same questions they are. In fact, anyone (including the critics) can join us in competing against the experts on the Maby website. Just as we think that the experts who keep score should be listened to over those who don’t, the same goes for the critics.
As the question outcomes become known, we will be able to measure who the best forecasters are on our Brier score leaderboard. Everyone will get to measure their calibration, which does not depend on your knowledge of a subject, but rather your knowledge of your knowledge. You can practice being calibrated in any domain, you just need to be appropriately humble if you know little.
We don’t know if we’ll be able to beat the experts forecasting COVID-19. There is no question that the experts know more about their domain than we do. Our primary goal is to help people think more clearly about how to judge expert forecasts and to show how forecasting might improve decisions in a wide range of fields. We admit that it is a bit terrifying to be publicly forecasting, knowing that we will be held accountable. We forecast that we will look silly more than a couple times, but we also believe that this feedback will help us improve.
Interestingly, research has shown that forecasting is a skill that can be distinct from subject matter expertise. In fact, we — the authors of this essay — have been practicing forecasting for over 10 years between us. It turns out we’ve done pretty well answering questions across a wide range of areas where we are not experts. While we are clearly at a disadvantage in our knowledge of epidemics, we hope our forecasting ability gives us a fighting chance. We’ll find out the only way it’s possible to find out — by keeping score.