What’s a forecast “skill score”?

george s.
high stakes design
Published in
5 min readApr 8, 2020

How we sorted forecasts in the context of Viziflu 2

Infectious disease forecasting is an emerging analytical capability. When coupled with data visualization, it has helped support public health decision-makers during recent Ebola, Zika, and seasonal influenza outbreaks. As the epidemiological community embraces new forecasting methods — and concomitantly, new forecast visualization tools/uncertainty visualization methods — visualizing accuracy-related data in this setting provides important context for forecast consumers. However, in order to do that, we must first take a deeper dive into how forecast accuracy is measured.

This blog post explores one particular approach (forecast skill) that has demonstrated value in the past and that IQT Labs used to sort forecasts in the context of Viziflu 2. (Note that this is only one of several ways to assess the analytical track record of forecasting models.) Anyone interested in learning more about related metrics should carefully review the biostatistics literature cited at the end of this article. Since this post coincides with the COVID-19 pandemic, we reiterate the admonition from the Data Visualization Society to #vizresponsibly.

Above: the newly-released Viziflu 2 interface showing forecast accuracy-related data

Since overreliance on low-performing forecast models “can have negative consequences, including the loss of credibility, wasted and misdirected resources, and, in the worst case, increases in morbidity or mortality” (Ray et al., 2017) understanding forecast accuracy is essential for operational decision making. In addition, a lack of best practices for communicating uncertainty creates a risky tradeoff. If forecast uncertainty is not represented visually it may be overlooked, however, if visualized ineffectively, uncertainty data may confuse or frustrate users.

In addition, since “[t]he list of organizations that produce or buy forecasts without bothering to check for accuracy is astonishing” (Tetlock and Gardner 2015), the question of how to visualize forecast accuracy over time is fundamental.

Above: Comparing precision and accuracy. (a) is neither precise nor accurate, (b) is precise and accurate, (c) is precise but inaccurate. Source: CK-12 Foundation, Wikimedia Commons license.

Most people are familiar with error-based accuracy assessments and forecast performance measurements that take the difference between a predicted value and an observed value. However, many infectious disease forecasters (and weather forecasters, among other groups) use an alternative approach: skill scores. The reason for this is that error-based accuracy assessments fail to consider the confidence of the forecast, or rather, distributional sharpness. Thus, when evaluating forecasts of categorical events with mutually exclusive outcomes, error-based metrics “face a challenge, in that the forecasts take the form of probability distributions whereas the observations are real-valued” (Gneiting, Balabdaoui, and Raftery, 2007).

In contrast, forecast skill scores — which reflect the set of probabilistic forecasts that assigned the highest confidence in the eventually observed outcome — enable us to compare forecasts’ underlying cumulative distribution functions. They also reveal which forecasts correctly predicted a given outcome with high confidence, or conversely, underestimated that outcome with potentially negative consequences.

To give a concrete example of this, Carnegie Mellon University’s Roni Rosenfeld suggests the following thought experiment:

[Imagine you are] predicting the dominant Influenza H3N2 antigenic strain in the coming season (which is important for deciding on vaccine formulation for that season). Assume that three different models (M1, M2, M3) participate in the competition, each providing its respective estimate of the probabilities of each of the three outcomes.

Table adapted from Rosenfeld et al. 2012 position paper

If asked to predict which strain will dominate, all three models will choose ―A/California/07/2004/, since this is the most likely outcome according to all of them. Note that, whether or not this turns out to be correct, nothing will be learned about the relative accuracy of the models. Instead, [the competition organizers] ask the models to provide their complete probability distribution estimate (the column of three numbers above, which always must sum to 100%). After the actual outcome becomes known, [the organizers can] compare the models on the log likelihood they assign to the event that actually happened.

Prof. Rosenfeld then goes on to explain why this approach to comparison is useful for public health purposes. If, for instance, A/California emerges as the dominant H3N2 strain, then M1 (log(0.9)) outperformed M3 (log(0.7)), which in turn, outperformed M2 (log(0.55)).

If, however, the dominant strain observed among patients ends up being neither A/California nor A/Korea then we perform the same calculation focusing on the forecasted probabilities for “None of the above.” In this latter scenario, “models M1 and M2 will be scored the same (log(0.05)), but M3 will be assigned a significantly worse score (log(0.01).” Finally, Rosenfeld adds that that this hypothetical score is “appropriate, because M3 significantly (and potentially dangerously) under-estimated” the outcome.

Thus, even though they are more complex than root mean squared error, forecast skill scores have a distinct advantage in allowing analysts to evaluate and compare probabilistic forecasts on the basis of confidence. For this reason, “[f]orecasts with confidence measures provide the public health decision-maker a more useful product” (Lutz et al., 2019), hence the our decision to incorporate skill scores into Viziflu 2.

It remains to be seen if, or how, forecasting techniques, evaluation metrics, and uncertainty visualization approaches developed for seasonal influenza might inform forecasting models for COVID-19. However, IQT Labs and B.Next are honored to be involved in monitoring and supporting response efforts with advanced analytics and insights, and we encourage readers to continue doing their part to flatten the curve by staying indoors and washing hands.

We reiterate that the forecasting models presented in Viziflu are not official CDC forecasts and are not endorsed by either CDC or IQT Labs.

To learn more about IQT Labs and B.Next’s uncertainty visualization research, please explore the following related blog posts:

  1. Bioviz Under the Microscope (Part 1)(high-stakes design, Feb. 2020)
  2. Bioviz Under the Microscope (Part 2)(high-stakes design, Feb. 2020)
  3. Visualizing flu forecast data with Viziflu (high-stakes design, Dec. 2019)
  4. Viziflu — Discussion on the collaborative effort between IQT Labs and the CDC Influenza Division (BioQuest, Mar. 2019)
  5. The Walk of Life (BioQuest, Nov. 2018)

👉 Alternatively, visit high-stakes-design to see other dataviz-related posts from IQT Labs.

--

--

george s.
high stakes design

👨🏻‍💻 open-source data visualization at IQT Labs