How did we predict sales for products with almost no historical data (launches)

Kasra Mansouri
Artefact Engineering and Data Science
10 min readNov 25, 2021


The path to developing a high-performance demand forecasting model - Part 4

Photo by Markus Winkler


Until now we have mainly talked about forecasting regular products that have been on the shelf for quite some time. But what about products that have been very recently launched ?

In this article we will show you how we tackled this issue by proposing an alternative solution to our boosting model based on product look-alikes.
It consists of training a clustering model on old innovations to detect sales pattern similarities, then mapping that to product characteristics with a classification model and then inferring the sales of new innovation with simple statistical methods based on the outputs of model 1 and 2.

Why is forecasting launched products so hard ?

A bit of context

Product launches are very common in retail. It is in CPG companies’ nature to diversify their products. They want to appeal to their customers and adapt their products to behavioural changes. And the company we did the project for was no exception, they nearly release 50–60 innovations (new products) every year.

We define an innovation as being rather a new brand of product being launched or a new product with a new flavour being introduced to an already existing brand. After a certain time (for instance, in our case, 2 months) an innovation becomes a regular product.

There are other cases where a new product code can be generated like

  • Slight change in weight of an existing product
  • Slight change in the formula of an existing product
  • Change of factory producing the product

We do not consider these cases as innovations but renovations and therefore they are out of our scope. We had access to product launches with sales data from 2017 to 2019, so roughly 150 innovation products.

The main challenge

Our job was to make weekly forecasts for the 14 weeks ahead at warehouse/daily/product level.

The first time the model had to be applied was 7 days after its launch, meaning it would only have 7 days of historical data and had to predict sales for the coming 14 x 7 = 98 days ! The last time we had to apply the model was 60 days after the launch. This means that a launched product would be considered as a “regular product” after 2 months since it would have achieved some “stability” in its sales cycles, but of course, this will depend on your company’s activity and product nature. After this 2 months phase, the core model (model developed for regular products) would take over the innovation model and make forecasts for the given launch, this decision was validated by the business teams that had some knowledge on the sales cycles of innovations.

Very limited historical data

The amount of historical data at our disposal is critically low for making accurate forecasts. We have a training dataset which is much smaller than our prediction/test dataset and that often brings very poor forecast accuracy. We are talking about having a few weeks in training data and having to make forecasts for 98 days.

Launched products have very unstable sales at the beginning

Besides, sales data are often very messy during the first weeks of launch as trends can dramatically change. In fact retailers order huge quantities of the innovation in the first 2 weeks to fill in their shelves, they call this the fulfillment phase. After that orders drop generally on W3 and W4 and then they increase a bit.

Again, here a lot can happen : some retailers might realize that the new product is not selling very well during the first weeks in its stores and therefore drastically reduce its orders as a quick reaction. Or a product might have huge volumes of orders from some retailers that the manufacturer might not be able to fully fulfill because they did not anticipate this level of demand. The retailers will keep ordering high volumes during the following weeks to compensate for the volumes for which they were not fully delivered.

Figure 1. Sales of a random innovation during its first weeks

All the elements listed above are a clear explanation of why predicting innovation sales is hard and therefore why we had to develop a specific model for these products.

Evaluation criteria

We had the same evaluation metrics as for regular products (W+1 and W+2 Forecast Accuracies at national/weekly level). As a reminder here is the formula of a forecast accuracy:

We added another dimension to this problem. We calculated these metrics for each prediction date (the date at which we make our prediction) with regards to the launch date.

We monitored W+1 and W+2 forecast accuracies for prediction dates going from D+7 until D+49, meaning that our scorecard would look like this:

Figure 2. Our initial scorecard

The reason we added this dimension is that it is really important to demand planners to monitor the performances of their forecasts during the first weeks of launch. A model that provides good quality forecasts during the first weeks enables them to avoid big replenishment issues (that are common after the launch due to uncertainty) and ultimately succeed in their launch.

Proposed approach

Why did classical ML provide poor results ?

As said before we have a “small data” problem. The small size of our training data and its very erratic nature makes it unsuitable for a machine learning model.

Most often, boosting models require 2–3 years of historical sales data to detect general trends/seasonality effects and at least 2–3 months of Time Series data for one product to make “accurate predictions”. This was not the case for us; our XGBoost model would only see a few weeks of historical sales for each product.

We trained an XGBoost model on innovation products with the following logic

  • We trained a model at product/national/daily level. The idea was to then disaggregate the forecasts at products/daily/retailer level with some business rules.
  • We fed the model with the same features as the core model, with the exception of one additional feature which is the launch date of the product.
  • We performed the train/test based on products
  • All the products launched before 2019 were in the training set, e.g we put their sales during their 2 first months in the training set
  • All the products launched after 2019 were in the test set.
  • The model would be trained several times, each time until the day before the prediction date (D+7, D+14, etc)

And the results were not good at all ! Therefore we decided to completely change our approach for these products.

Figure 3. Our FAs with XGBoost for D+7 and D+14 (the higher the better)

Classical Time Series models did not save the day either

Since our first iteration with XGBoost was a total failure, we decided to try our luck with some good old-fashioned time series models (VARMA, SARIMAX, Prophet, etc). Time Series models have the virtue of being simpler and easier to interpret than boosting models. Also, since they are trained on one product, they can more easily adapt to abrupt trend changes.

The idea was to train one model (a Prophet model for example) per product that would make forecasts at national/daily level. Like above, the model would be trained several times too, each time until the day before the prediction date (D+7, D+14, etc) and make forecasts for the 14 week horizon.

But here, like above, the results were still not satisfactory and the reason was pretty much the same: not enough historical data ! The history was too shallow and the data was too messy for the model to understand a clear pattern and make accurate predictions.

Figure 4. Example of a prophet forecast for one product after D+30

Product clustering based on look-alikes : the Messiah

With our two first attempts at making forecasts being inconclusive, we decided to go completely in another direction.

We analyzed the sales of all innovations from 2017 to 2019 and realized that many had similar sales curves during the first months of their respective launches, this gave us the idea to make forecasts based on product similarities with old innovations. Our intuition was strengthened since we found some research papers of other fellow data scientists suggesting a similar approach (1).

Our approach consisted of 3 fundamental steps

Clustering of old innovations based on sales — Data scope : Innovations launched from 2017 to 2018 included

The idea is to cluster old innovations based on their time series curve into groups of similar products. Products being in the same cluster have similar sales patterns and therefore the sales of a single innovation can be deduced from knowing the sales of older innovations.

Our input data was the MinMax normalized cumulative sales of old innovations during their 5 first months. We looked at cumulative sales because we wanted our clustering model to get a sense of when each product would reach its regular sales pace. We normalized the cumulative sales because some products had the same curve shape but different sales volume and we wanted to put them in the same cluster as we focus here on similarity.

We clustered our products with a K-Means model with the K parameter being chosen automatically through a benchmark using the Silhouette index. The optimal number was K=4. We also tried different distance metrics between euclidean and DTW, with euclidean giving more satisfactory resultats during our tests.

Figure 5. Our clusters
Figure 6. 3D projections of our time series with UMAP. Each color represents an assigned cluster

Classification model predicting a product’s cluster based on its attributes — Data scope : Innovations launched from 2017 to 2018 included

The idea here is to find a link between the cluster the product belongs to and its attributes, like dimension, brand, flavour, etc, as well as some contextual data like launch period, media campaigns, etc. That link would be found with the help of a classification model.

The model would be fed with product attributes alongside with contextual data and it would try to predict an innovation’s cluster based on that. The intuition behind this approach is to see if we can answer the following question :

If we know an innovation’s attributes + contextual data in advance, can we predict the kind of sales behaviour this product will have after its launch ?

Since we had very limited data (roughly 100 products), we trained a simple BaggingClassifier on our products with a 75%-25% train/test split. We ended up with a F1 score of 84% on the test set.

We performed some analysis on product attributes to see if we can find some business explanation for each cluster and here are our findings:

  • Cluster 0 innovations are more expensive than others and are of bigger size. They are usually launched during the winter and start with a slow sales pace.
  • Cluster 1 and 2, which we can qualify as “regular innovations”, have similar characteristics and are usually launched during the spring. Although Cluster 1’s products are a bit more expensive therefore they take off a little slower in sales.
  • Cluster 3, regards products that are sold quickly at the beginning and then delisted. They are usually launched at the beginning of the summer. Basically these are seasonal products, like ice-creams.

Forecasting new innovations sales based on predicted clusters — Data scope : Innovations launched after 2019

Now that we are able to link an innovation’s sales to its attributes, we can apply our model to new innovations to determine their cluster and hence predict their sales.

We started very simple. For each new innovation (launched after 2019), we calculated the average sales of old innovations (launched before 2019) that were in the same cluster as the new one, during their first weeks of launch and considered that as our forecast for the new innovation.

This simplistic approach was actually a good start but had one major flaw : We were able to accurately predict the shape of the new innovation, however the scale of our prediction was not good.

We decided to scale our predictions by comparing the sales of the new innovation and the old similar ones from D+0 until the prediction date (prediction date). Since sell-out data was as insightful as sell-in data in predicting future sell-in in our case, we calculated 2 scaling coefficients, one with sell-in and one with sell-out.

Sell-in coeff = average sell-in of old similar innovations until prediction date / average sell-in of new innovation until prediction date

Sell-out coeff = average sell-out of old similar innovations until prediction date / average sell-out of new innovation until prediction date

We finally upscaled/downscaled our predictions with the mean of these 2 coefficients.

Figure 8. An overview of our approach

Results & conclusions

Here are our results on each prediction date and prediction horizon:

Figure 9. Our final scorecard

We were quite satisfied with these results as our model was able to beat demand planners’ forecast in almost every KPI until prediction date 49.


  1. New Product Forecasting with Analogous Products

I would like to thank everyone who has worked on this project, specially Sylvain Combettes, without whom this article would have never been written.

And thank you for reading that far! I would be glad to hear your feedback. If you enjoyed the read, feel free to check our open positions at Artefact 🙂