How good is your forecasting? Unpacking metrics to evaluate true business impact of your models

Emilio Lapiello
GAMMA — Part of BCG X
7 min readJun 14, 2022

--

Assessing machine learning models with granular metrics can boost business impact.

by Emilio Lapiello, Mikel Arizaleta, Andra Fehmiu, Alessandro Scaglia, Wenting (Tina) Hou

There are multiple machine learning metrics data scientists use to assess the accuracy of forecasting models. No single metric is considered ideal. Most, including MAPE (Mean Absolute Percentage Error), WMAPE (weighted Mean Absolute Percentage Error) , and R2, share a common approach: Using the notion of an “average” to provide feedback on model performance. MAPE, for example, describes the average absolute percentage error a model makes when comparing its predictions to actual values in a back-testing effort.

Aggregated “on-average” metrics such as MAPE help data scientists compare model performance against a benchmark. But these metrics often lack the detail needed to understand whether the model is providing predictions that are useful from a business perspective. Model predictions are much more valuable when they enhance business knowledge and help users understand how to solve the problem at hand.

The value of model predictions depends strictly on the nature of the sought-after outcome. Say you are a restaurant chain manager and want to understand hourly customer demand to optimize staffing at each of your restaurants. To have business value, your model must be accurate during demand peaks and troughs throughout the day, rather than on average. Knowing that your model’s MAPE is low, for instance, would be meaningless if you can’t determine whether the error is concentrated at high or low demand hours.

Similarly, say you sell canned soups in different flavors and want to forecast the number of boxes of each product your sales representative should be able to sell to any given store in a month. The challenge is that a typical store buys very few of some soup flavors (perhaps one or two boxes), but dozens of boxes of other ones. In this situation, an aggregated metric might be highly skewed by the quantity of the less-popular products sold — even though these sales would not have a large impact on your business overall performance. If, for example, the metric predicted that a store would buy two boxes of the product when, in fact, it bought only one, the metric would account a 100% error on this item.

Clearly there is a business need to evaluate forecasting models beyond aggregated mathematical metrics and consider the real business impact forecasting errors make.

Introducing F-MAPE, a new way to evaluate the business impact of forecasting-model errors

To overcome the limitations of the MAPE approach, we have developed a new way to assess model forecasting accuracy — one that we believe helps solve some of the issues that surround aggregated metrics. We call our approach Factorized-MAPE (or F-MAPE) because it unpacks or “factorizes” the summation components of MAPE. In doing so, it provides feedback on model performance that is actionable from both a business and a data science perspective and presents a clearer view of any error that might be induced from using the forecasting model.

To demonstrate the F-MAPE approach, we will use the canned soup sales forecast example referred to above. Assume we have built a model that predicts monthly sales volume for each flavor at each store. Our goal, using a back-testing exercise, is to compare the model’s predicted sales volume to the actual sales volume. In doing so, we will focus on two numbers:

1. The error (with sign) of our sales volume prediction for each product-store combination (i.e. actual volume minus predicted volume) in a specific month

2. The actual volume sold for the product-store combination in the same month

For the sake of argument, we will stipulate that the error spans from -147 to 147, and that actual sales span from 3 to 740. We can bin both dimensions and use those bins to create a table. In each cell of this table, we represent the percentage of model predictions falling into the relative range of both error and actual volume sold.

Figure 1: Percentage of predictions by volume prediction error and actual volume sold

For example, the highlighted cell shows that 1.2% of all our predictions are for products that sold between 10 and 20 boxes and have a prediction error between 1 and 2 boxes.

From this table we can also derive a maximum percentage-error for each cell by dividing its maximum error range by the minimum actual range. For example, the highlighted cell only contains predictions with a maximum error of 20% (i.e., 2/10).

Figure 2: Maximum error prediction by cell

How is our model performing?

If we set a model error tolerance threshold of, say, +/- 20%, then we can tell whether our prediction errors in a cell are tolerable or not: whether it is, in effect, a set of “good” or “bad” predictions. In the following table we have highlighted in green those cells that have good (i.e. tolerable) error predictions:

Figure 3: Percentage of predictions with tolerable error (green) vs not (white)

Moreover, by summing all such “good” cells, we get a measure of model performance ‘accuracy’ strictly defined by business-set error tolerance, which, in this case, is 73.2%.

Notice that:

  • Businesses can set different tolerance thresholds according to their needs
  • Tolerance thresholds need not be symmetric for over- and under-prediction errors
  • Bin sizes impact the model performance accuracy summary metric described above — because we are conservatively assuming all predictions in a specific cell have maximum possible error
  • Bin ranges can be chosen according to the required analysis resolution and detail needed
  • It is possible to calculate an accuracy summary metric using each individual prediction percentage error and then counting how many predictions are within our tolerance thresholds

In the product store volume example, we decided not to use predictions for products that had only recently been introduced at specific stores and identified as those selling less than 3 boxes. We could also decide that our error tolerance is 20% in over-prediction, but only 10% in under-prediction in order not to encourage a loss in sales volume. With this business input, we can determine that 68.2% of the time our model predictions are within the business-set margin of error.

Figure 4: Percentage of predictions with tolerable error, asymmetric tolerance thresholds

The Considerable Benefits of F-MAPE

F-MAPE unpacks the “model accuracy black-box”, which leads to specific actions that both business and data science stakeholders can take. We identified more such benefits while designing and implementing the F-MAPE approach.

First, F-MAPE gives business stakeholders the ability to effectively compare the percentage of good-versus-bad predictions based on those factors that matter most to them. If, for example, the business is willing to accept overpredicting but not underpredicting, then F-MAPE tolerance thresholds can be set to account for that.

Stakeholders can also use F-MAPE to ascertain for which products and store locations the predictions are good or bad. It is possible, for instance, to use the F-MAPE matrix to extract the bad predictions and determine whether they tend to happen for specific products, stores, or any other relevant feature.

Stakeholders can exclude from the model accuracy calculation predictions that, from their business perspective, are not important. If, for example, a model prediction errs on products that sell less than 5 boxes and low sales products are not relevant, the business can simply exclude those predictions from its model accuracy assessment and get a better view of the actual value of model predictions.

From a data science perspective, F-MAPE makes it possible to identify potential model bias in predictions. For example, the metric enables data scientists to quickly observe and calculate if the right-hand side (with respect to 0 error) of the F-MAPE table includes more predictions than the left-hand side. Data scientists can further quantify this skewness by row and investigate potential trends in model bias.

F-MAPE also enables data scientists to focus further research and model development on improving the model in those areas where it is not performing well, rather than focusing on improving its performance on average. If the data science team detects that the model is failing in specific areas of the matrix, they can then flag those areas and build a descriptive model to automatically profile instances in which the model fails, and focus improvements on those by, for instance, using more or better data for specific products or stores.

Finally, F-MAPE allows data scientists to isolate predictions for products or stores they know are not correctly predicted by the current model, and then build separate models for those. If the matrix generates significant errors for product with large sales volumes, then the data scientist can, for instance, create a separate model for just those products.

Getting to True Business Value

F-MAPE provides a novel and effective way to unpack aggregated forecasting metrics and improve forecasting model performance using business goals as its compass. We believe this approach of separating and binning dimensions in accuracy metrics such as MAPE can, in general, lead to a better understanding of the business benefits of forecasting predictive models and how to improve these models to effectively drive business value.

--

--