Demand Forecasting Evaluation: A Single Metric for Optimal Planning

Slava Bazaliy
GAMMA — Part of BCG X
13 min readMar 25, 2022
Demand forecasting always comes with uncertainty. When someone reports that a forecast is 90% accurate, what does it mean? What if another metric says that it is only 60% accurate? Do any of these accuracy numbers translate to actual business value and how? The answers to these questions exist, but they are not simple!

By Viacheslav Bazaliy, Slobodan Milovanovic, Antti Niskanen, Daniel Sack & Jan Beitner

For any business, demand forecasting is a crucial component of an end-to-end (E2E) planning process. It enables optimal decision making amid times of uncertainty, promotes efficient supply chain management, and acts as a real-time indicator of relevant market trends. Whether used for planning sales of mature products in well-known channels or of entirely new products in a pioneering market, demand forecasting adds significant depth to the decision-making process. However, as with all things that concern future events, forecasting brings uncertainty to the planning process. Planning can be optimally performed only if these uncertainties are correctly quantified — shifting the challenge towards obtaining the best prediction accuracy.

Over the past three years alone, BCG GAMMA has completed more than 40 large-scale transformations built on a foundation of improved forecasting. Cumulatively, these transformations have generated more than $10B uplift in revenue. GAMMA’s approach builds value and competitive advantage at the intersection of data science, technology, people, business expertise, processes, and ways of working. We have observed that, with recent trends toward strong digitalization and data democratization, demand forecasting accuracy has improved tremendously — and has done so across numerous industries. This is a welcome development given that the speed of market entry for products accelerates year over year. This improvement in accuracy is largely the result of advancing machine learning (ML) methods. Cutting-edge ML models can now incorporate thousands of factors, learn patterns from past data, and provide a market overview that enables improved business decision making.

When employing ML methods, data scientists tend to report their achievements and progress using out-of-sample metrics such as mean-squared error (MSE). Those with technical backgrounds can usually understand the meaning of metrics like MSE because these concepts are studied in universities and have clear probabilistic interpretations. But it can be nearly impossible for many business leaders — including those with STEM degrees — to understand model performance without a clear context of the problem scale, understandable performance benchmarks, and, most importantly, a direct connection to business processes. Company executives usually have solid domain expertise and a great understanding of the business context. However, for many business leaders, assessing the technical context and making informed decisions quickly, remain daunting challenges.

So, is there a single evaluation metric to optimally plan a business?

Forecasting from the Executive Perspective

Imagine that we are running a small kiosk that sells coffee, pastries, and a variety of confectionery goods. We have a friend who is a data scientist, and who has graciously agreed to help us with our inventory buying decisions. She has trained three models on our past data and has now shown us the backtesting result from the past week for one of the products. From now on, we will use nicknames No-sales, Average, and ML for corresponding forecasts displayed on the graph below.

Actual sales and backtesting predictions from three pre-trained models

With this result in hand, we can now choose which of these three forecasting models to use. But which model should we, the executives of this small business, choose? The answer depends heavily on the product nature and business constraints of our coffee kiosk. We will go deeper into the business context of our example in the following sections. But first, we will draw on our extensive BCG GAMMA experience to help our friend the data scientist establish the technical framework for the demand forecasting assessment.

A 3-step Approach to Choosing a Forecast Model

Let’s take a step back and examine how to approach demand forecasting evaluation from a business perspective. This type of evaluation fits into the general ML model assessment framework. The framework’s goal is to construct a procedure that results in an unbiased, out-of-sample accuracy estimate. However, a few aspects complicate the demand forecasting evaluation:

1. A time dimension that imposes additional assumptions on the generating process and restricts us from randomized data splits for out-of-sample assessments of errors

2. The difference between observed demand, which is limited by factors such as stock level and sales, and actual unobserved (unconstrained) demand

3. Zero-inflated data at low granularity levels, and violation-normality assumptions for model residuals

This is an abbreviated list of complications that can differentiate demand forecasting from traditional regression problems. Such complications are always present to some degree. We propose a three-step approach that allows us to evaluate a predictive model from a business perspective to address them systematically. Note, that it differs from modelling or training assessment that might require totally different aggregation levels and loss metrics options. Let’s dive into the details of all of the steps.

Step 1: Select aggregation level

We suggest the aggregation level selection as a first step because this choice will influence your options both for the validation procedure and applicable metrics. As stated above, we look at this purely in business terms. From a modelling perspective, this question can be irrelevant, e.g., hierarchical machine learning models can utilize all levels and benefit from reconciliation techniques.

Looking from this angle, the most appropriate aggregation level is naturally defined by the inference we want to do based on the demand projections. For instance, if we run a stock allocation across stores, the best option is to look at store level forecasting errors while the whole chain would be insufficient. One could stop the discussion at this point. Still, it turns out that statistical properties of the less aggregated levels might limit the scope of suitable metrics to sophisticated and unintuitive options. Those can be difficult to communicate to business stakeholders and eliminate the transparency of the evaluation process.

For example, on granular levels, such as daily sales of a specific product in a particular store, we often observe many zero sales and only a tiny fraction of actual positive sales. These distributions are named zero-inflated and require certain statistical assumptions for the underlying mixture of data-generating processes. An overdispersed Poisson distribution, such as the Negative Binomial distribution, in particular, is a good default distribution option for modelling such data.

Example of daily sales quantities histogram for a fashion product in one store

However, the most popular evaluation metrics like mean squared error or R2 assume normally distributed errors. Thus, model selection with these metrics on granular levels can be suboptimal and biased. Luckily, when we aggregate data, the central limit theorem starts to play in our favor. Aleatory uncertainty that is large at granular levels vanishes when we model a significant volume, and the distribution of aggregated demand converges to normal.

Although technically the best evaluation granularity can be derived from the target business decision, in practice we face another trade-off between complexity and a clear understanding of the evaluation process. What is best depends on the exact context. Thus, we should always carefully examine possible aggregation options for both hierarchy (such as product versus product category) and time (such as days versus weeks).

Step 2: Set-up validation procedure

It is important to differentiate between actual (observed) sales limited by the stock levels and unconstrained (unobserved) demand that could be realized under perfect conditions. Stock level is the typical limiting factor for sales, but other events such as in-store operation failures or holidays can also distort the sales picture. We highly recommend for you to account for this difference in your modelling and set up the unconstraining procedure for your demand forecast target.

Demand forecasting operates within a time series. We recommend that standard practices such as rolling cross-validation procedures always be applied so that you can construct unbiased, out-of-sample accuracy estimates and prevent data leakage in the evaluation. In order to obtain unbiased validation, the train-test splits have to be representative. In particular, they should account for seasonality, special days, and other relevant systematic differences between periods of time.

Step 3: Select evaluation metric

In general, demand forecasting is formulated as a regression problem. Evaluation metrics in regression problems can be split into bias and variation (accuracy) classes, where bias indicates signed deviation from actual values (location), and accuracy evaluates unsigned average deviation (variance of data). Note that this split of metrics is not based on the bias-variance trade-off concept.

In business applications, the selection of evaluation metrics also involves a trade-off between interpretability and statistical rigor. The percentages might be more intuitive to interpret, but actual business KPIs will depend on absolute variation. Wrongly selected KPIs might lead to suboptimal hyperparameter selection but might also create transparency that can accelerate business adoption of a new machine learning-based forecasting tool. As such, it is essential to have a clear understanding of the underlying probabilistic assumptions for different KPIs.

Commonly used metrics

Let’s now examine the list of commonly used metrics for demand forecasting evaluation, focusing on point estimates. Some metrics, like MSE or MAE, originate from log-likelihoods of corresponding probabilistic models. While others, like R2 or MAPE, are preferred due to their standardized scale and more intuitive interpretation.

Example: Model selection with common metrics

Now we have all the knowledge we need to select the best forecast from available options. We want to have a complete view when backtesting, so we suggest to our friend that she should calculate bias and three other common accuracy metrics to help us select the best forecast.

Revisited: actual sales and backtesting predictions from three pre-trained models

Let’s look at a few popular evaluation metrics for our coffee-kiosk business problem.

  • Bias — metrics that, in simple terms, tell us how off the model predictions are in percentage to the average target value.
  • SMAPE — symmetric version of the mean absolute percentage error that compares absolute error to the average between forecast and target. The latter property guarantees that the value always belongs to the 0–200% interval.
  • wMAPE — another weighted version of MAPE, where individual absolute errors are weighted with target values. Unlike SMAPE, it does not have an upper bound.
  • R2 (coefficient of determination) — estimates the share of variation in data explained by the model predictions. This coefficient originates from classic OLS (ordinary least squares) methods and provides a number between -100% and 100%.

It seems that this approach makes things even more confusing. A common combination of SMAPE and bias would select the average forecast in this case, while a machine learning forecast is preferred by wMAPE. On the other hand, wMAPE values for average forecast and no-sales forecast are almost identical, so this metric alone can also be misleading.

How can we resolve this disagreement between different metrics? Let’s return from the world of math to the world of business.

Example: Adding Business Context

Now we will need more context about the product we are modelling. Let’s consider two different scenarios where sales and forecasts correspond to:

  1. Ice cream
  2. Donuts

We will assume that the price point and, thus, the average yearly revenue for both products are quite similar.

Product 1: Ice cream

From a business perspective, if our coffee kiosk has a freezer with sufficient space, we can store unsold ice cream there and won’t have to account for daily sales fluctuations. The main goal of modelling for icecream, therefore, would be to keep overall bias close to zero.

Let’s assume that we do daily allocations according to next-day forecasts, and that we can keep unsold products in our on-site freezer.

Ice cream scenario sales and stock allocation for Average and ML forecast

Note: We intentionally used unconstrained demand, which is different from actual sales. Sales on the last day in our example were zero, but in our scenario, this was the result of a stockout.

According to the average forecast model, total sales for the allocation strategy are 15, while total sales for the machine learning-based strategy are only 12. Assuming a $3.00 gross margin per item, our kiosk business would get a 25% uplift from using the average forecast compared to the machine learning forecast.

Product 2: Donut

Unlike ice cream, donuts should be sold while fresh, with unsold products thrown away at each day’s end. In our kiosk, we do not make donuts ourselves. Therefore, we would pay more attention to daily dynamics in this scenario since overstocking donuts would significantly decrease our profits given the high cost of goods sold (COGS).

Let’s assume that we do the same daily allocation according to the forecast, with the caveat that we must scrap the unsold products at the end of the day.

Donut scenario sales and stock allocation for Average and ML forecast

In this scenario, we get the same amount of sold items for both average and ML forecast allocations. However, for the allocation based on the average forecast, we bought more items, which, if unsold, would then have to be thrown away. For this product, the machine learning allocation was more accurate, with an overall gross margin 24% higher than that resulting from the average forecast allocation.

Financial results from backtesting

Combining metrics: aggregate and conquer

Generalized, clear and simple KPIs are crucial ingredients for making educated decisions. On the contrary, the discussion above illustrates that demand forecasting requires tedious business case analysis and selection of tailored metrics. However, we can (and sometimes have to) remove one dimension of complexity for practical reasons. Namely, for granular demand forecasts, we cannot evaluate all the individual category metrics, so we have to combine them into one or several KPIs that we can keep track of. In our toy store example, this would correspond to combining evaluation metrics across products we are selling — ice cream and donuts.

The simplest option for aggregating metrics that dominates in practice is to take the average. This robust and easy to explain aggregation provides a good insight into the performance, but often can be misleading for the actual incremental value of the model. For example, consider forecasting for two products where one of them has zero sales for the evaluation period. If the model predicts zero sales for both products, the resulting mean metrics might look sensible while the underlying forecast is practically useless.

The previous example illustrates that treating different metrics equally in aggregation can be deceptive. In reality, demand forecasts for some categories are more important than others for various reasons. This can be driven purely by business objectives or physical constraints like storage volume for large items. Clearly, storing additional bubble gum on the shelf is easier than finding space for another 5L bottle of milk. To tackle this aspect, it is common to use weighted metrics. They are still easy to grasp but proven to provide a fair estimate of business performance in practice. As an example, the weighted root mean squared scaled error was used for evaluation in the well-known Kaggle M5 demand forecasting competition by Walmart. The weighting by recent sales is chosen to select “the best performing forecasting methods to drive lower forecasting errors for the series that are more valuable to the company”. Based on that, using the convolution of metrics is probably the best universal approach for aggregation in general. But, can we do anything better considering the business context, as we did in our previous examples?

Let us again consider the problem of daily stock allocation. Once we have the trained model in hand, we face the following decision for each product: how many items should be delivered tomorrow given the remaining stock level and the forecast for tomorrow?

Despite forecast uncertainty, the optimal business decision remains the same within a certain range of outcomes

The next day realized sales can vary a lot, as well as EOD stock level, due to imperfect forecast and aleatory uncertainty of the sales process. Understocking would mean unfulfilled demand and lost sales, while overstocking might cause problems with storing the remaining items. However, we have a range of outcomes for which our decided delivery amount remains optimal, despite the forecasting error. Therefore, for a given product, we do not need the perfect demand forecast but the one that provides accurate enough predictions to keep stock within a certain predefined limit. For stationary processes, this requirement translates to the threshold for the appropriate accuracy metrics, e.g., MSE limit for the normal likelihood case. In this scenario, the best model for us would be the one that provides the desired accuracy for all products, and the corresponding KPI is the percentage of products where accuracy metrics is below the predefined threshold.

Once again, we have demonstrated that tailored business decision analysis can lead to a better model performance evaluation. Note that while this method is specifically designed for our context, the possibilities for tailoring are limitless. Careful examination of business context will be always rewarded by value gain from the correct decisions.

At BCG GAMMA, we use this value-centric approach in PLAN AI, an end-to-end planning solution. PLAN AI focuses on key planning decisions and brings internal and external data sources together to enable better decision making, and orchestrates, not only different accuracy metrics but also different forecasts into a single source of truth.

Learn more about PLAN AI, and feel welcome to reach out to the team by e-mail PlanAI@bcg.com!

Conclusion

As demonstrated by the examples, demand forecasting evaluation is rarely a straightforward matter, even for simple businesses. Like many other data science applications, such evaluations require a combination of strong modelling skills and sound business acumen. Furthermore, real-world applications come with a variety of products and business constraints which make it extremely hard, if not impossible, to arrive at a perfect metric with a closed-form expression. The aggregation of metrics across groups in the hierarchy is also a challenging problem that lacks a perfect generic solution. Hence, the absence of a well-defined target function for AI engines leaves no room for silver-bullet solutions that would solve the generic demand forecasting problem for all kinds of businesses at once. It is only the combination of domain expertise and data science methods, integrated into the business processes, that can enable businesses to unlock the full value of machine learning-driven demand forecasting.

References

--

--