Time-Series forecast at scale: data, modeling and monitoring approaches

Vianney BRUNED
Decathlon Digital
Published in
13 min readAug 2, 2022

Business Problem

In every Decathlon store, department managers lead various sports on the horizon of a quarter or a semester. This means they make business strategic decisions: store investments, salesman planning optimization, marketing events… A major decision is the presence of salemens in the store. Indeed, in the digital era, customers that come to physical stores expect to find much more than just a collection of items. The presence of salesmen is key to help customers find the products that best suits their needs. Good advice and assistance eventually improves client satisfaction and allows Decathlon to gain customer loyalty and boost sales. Moreover, salesmen present at the right time and place can directly collect feedbacks to enhance the overall customer experience (better product facing, identification of best sellers, and more). For instance, during the winter season, assigning salemens to the Ski department is obviously more appropriate than assigning them to the Surf Department. However, anticipating most/least crowded departments depending on the context (period of the year, sports events, etc…) is not always as easy as it seems. This is why department managers rely on detailed turnover forecasts to help them best assign salesmen to departments. Their main goal is to avoid a ghost problem: salesmen waiting for never-appearing customers and on the contrary, customers desperately looking for a salesperson. Such situations may cause great frustrations on both sides.

The situation that we want to avoid: salesman in an empty store. On the right, turnover of a department with a high seasonality effect.

Currently, forecasting the future turnover of a department is done manually and it is time consuming for department managers. This time could be better spent focusing on tasks with higher added value such as advising customers on the best products to buy depending on their sport needs/practice/level. Moreover, department managers only rely on the turnover history of their own store ( as well as the department) to make forecasts and, by doing so, miss the “big picture”. Gathering the sales of all products across all Decathlon stores allows us to identify patterns and similarities that could improve the forecast accuracy of every department (especially in the case of newly opened stores and/or department). Furthermore, the Covid crisis caused major turmoil in the retail industry. Most forecasting techniques used at Decathlon used to rely on some assumptions of stationarity of the demand and/or periodic behaviors, but the recent events have made them obsolete, calling for the use of more advanced algorithms. That’s where the Decathlon IA-Lab comes into play! We decided to design a Machine Learning-based forecasting engine with improved forecast accuracy compared to the existing solution.

The scope of the project is to predict for each store and department (in a defined area) the turnover with a weekly step and a horizon of 16 weeks. It is a classical forecasting problem with two hierarchical components:

  • geographical, stores of the same country or region have similarities
  • business, product tree is shared over countries.

Data Science Solution

In previous forecasting projects at Decathlon, we tested and implemented several approaches using either classical forecasting algorithms (SARIMA, Prophet…), classical machine learning algorithms (Tree based), deep-learning models and no code approach (Amazon Forecast, DataRobot..). The main issue with forecasting algorithms is to handle the hierarchical aspect of the time series. Of course, top-down forecasting or similar techniques are interesting but they tend to multiply the number of models which becomes a problem to monitor during the lifetime of the solution. Tree Based models have shown good results in the M-3,4,5 competitions (reference in forecasting problem), but those models don’t scale well , i.e. they require important works of feature extraction (for instance using tsfresh ) and feature selection. Deep learning (DL) models (DeepAR, Wavenet, TFT…) handle the hierarchical aspect and high volumetry. The main issue is the quality of the data used for training the algorithm. Furthermore, the technical debt associated with the construction from scratch of a DL model can be important. Low or No code approaches are interesting to test quickly the feasibility of a model, but they are often expensive and difficult to deploy with other production tools.

We tested the following methods:

Input data

Let’s start with the more important: DATA!

Time series

First, we have a time series for each couple of stores/departments. The starting date of the time series depends on the store opening. We choose to not consider the data before 2014. We filter the couples stores/departments that have a turnover over the past year below a certain threshold defined by the business. We observe a large panel of seasonality in our dataset: seasonal effect (winter vs summer sports), Christmas, soccer world cup, and sometimes none. We do not integrate any related or past time series because we wanted to develop a first version quickly. Figure 1 displays the turnover over time for a soccer department that has a clear seasonality.

Figure 1 — Example of the soccer turnover. The effects of the World Cup and EURO are noticeable on the seasonality of the turnover (higher turnover than the years without a global event).

Static inputs

Associated with the time series, there are many static inputs that could be categorical or numerical features: localization, country, department label, store-id… These features are linked to the hierarchical localization or the hierarchy of the department. We focus our attention on categorical features because we plan to use the Sagemaker version of DeepAR (which handles only categorical features).

We end up with 4 categorical features:

  • Country with 12 occurrences,
  • Department with 67 occurrences,
  • Store with 1 049 occurrences,
  • Univers with 16 occurrences.

The final size of our data set is around 51 000 time series with a weekly timestep.

Data Cleaning

When dealing with real ML projects, data quality has to be taken into account. In fact, we are facing several issues:

  • outlying values of turnover,
  • some data points are missing inside the time series,
  • different starting dates (cold start problematic).

And finally, there is the covid effect with its lockdowns and restrictions varying with the country.

Classical cleaning/preparation

The classical data preparation starts with the detection of outliers. Indeed, negative values or a million euros turnover in a department of a small store are problematic for a ML model. A simple threshold based on the standard deviation (usually +/- 6 standard deviation) is able to detect the outliers. The detected points are considered like NaN (Not a Number) values.

Then, we apply a global reconstruction of the time series to clean all the NaN value (missing or outliers) and to have the same starting date: it performs both interpolation (missing points inside the time series) and extrapolation (we build a “fake” history to remove the cold start problem).

To do so, we compute an aggregated time series at country/department level. Besides, we are able to compute an average weight for each couple stores/departments in the country. With these two data, we fill the missing value with the product between the local weight of the couple store/department and the aggregated time series at country and department level (see Figure 2). The main assumption of this method is that the aggregated time series is “cleaner” than the local ones (see Figure 3).

Figure 2 — Original turnover in blue with a start in 2017, outlying data in 2020. In red the reconstruction.

The reconstruction of Figure 2 is based on the pattern of the department at the country scale.

Figure 3 — Original turnover at country level. The covid periods are noticeable in 2020 and 2021 with unusual drops of the turnover.

It works well for the stores when the weight of the store is stable over the years but it can create some strange behaviors (see Figure 4).

Figure 4 — The time series started in 2016 (constant values in blue) and we clearly notice a rupture point of the reconstructed curve (red).

Finally, this classical preparation is not able to correct the covid period (where we have a lot of NaN values) because all the stores are impacted at country level (Figures 2, 3 and 4).

Dealing with covid periods

First, the detection of covid impact on turnover has to be done manually and it is not an easy task due to the variety of restrictions imposed by the governments.

Between 2020 and 2021 in Europe, we observe in Decathlon data two main covid periods:

  • March-May 2020,
  • November-May 2021.

The first lockdown is easy to detect in all countries, but the definition of the second one is more complicated because of the different lockdown strategies adopted by the country and the quick adjustment performed by the different governments. For France (see Figure 5), we created three covid periods over November-May 2021:

  • November to Mi December 2020,
  • January-mid February for curfew and closing ski station,
  • March to May for last lockdown.

In Spain (see Figure 6), the situation is quite different: even if many restrictions or perimeter lockdown had been set up, the impact on the turnover is limited to January-March 2021.

Figure 5— Example of turnover variations in France. The three lockdowns are well defined.
Figure 6— Example of turnover variations in Spain. The first lockdown is easy to spot and the restrictions in autumn and winter 2020–21 have less impact.

We choose to reconstruct the covid periods because we make the assumption that the restrictions put in place during the first and half year of the pandemic were exceptional and with the vaccination, these strict lockdowns with a large impact on our instore turnover will not appear again. So far, in our perimeter, this assumption is valid with one exception: the Netherlands with a lockdown of 4 weeks during Christmas 2021.

The so-called “covid reconstruction” is based on a regression technique (Tree based model mentioned above). The difference with forecasting algorithms is that we know the future: we defined a training database that included points before and after the lockdown and also there are global features over time. It is more an interpolation than a regression.

We define a training set with the following features:

  • a time index,
  • week number,
  • month,
  • time series (store/department) id,
  • department,
  • store,
  • univers,
  • turnover lags of 52 and 104 weeks,
  • Global statistics of the time series: min, max, mean, median, standard deviation,
  • average turnover of the department for the current week number over the past years,
  • average turnover of the store for the current week number over the past years,
  • average turnover of the universe for the current week number over the past years.

The regression target is the current value of the turnover for the couple department store.

After some backtesting experiments (simulation of a lockdown in 2019), it appears that a lightgbm model provides the best results. The most important features are the turnover lag (104 weeks and 52 weeks). Figure 7 displays the results of the reconstruction for a department/store.

Figure 7 — Covid Reconstruction for a French store/department.

Modeling

Metrics

The modeling part always starts with the definition of a metric to compare existing solutions or a very naive baseline with the developed algorithms. In the literature, there are many forecasting metrics. We choose the Weighted Average Percentage Error or WAPE.

WAPE or WMAPE formula [Wikipedia]

Thanks to the weighted part, WAPE can handle the seasonality of the time series. The WAPE is also a score easily understood by the business (kind of percentage error). In our problem, the algorithms will optimize the global WAPE (over all the countries) but for the comparison with the business solution, looking at the local wape (average or median of all the wape by department/store) is also an interesting criterion.

Models

First, we run an Amazon Forecast experiment using the AWS user interface. We use the AutoML algorithms with 5 successive backtesting periods starting from April 2019 to July 2021 with a horizon of 16 weeks. Overall, DeepAR + (AWS Forecast variant of DeepAR, there are many variants of DeepAR in AWS!) with a tuning provides better results (see Table 1). Classical algorithms like ARIMA or Prophet deliver interesting results too.

Table 1 — WAPE of Amazon Forecast Backtesting (lower is better) for different cutoffs.

The models have some difficulties during the period October 2020 to May 2021 because of the covid effects: even with the reconstruction done, the pattern or the trend is deeply impacted by the collateral effect. For instance, in Europe the access of swimming pools was restricted and the 2020/21 skiing season was the worst of the last decade in terms of sales.

Given these preliminary results, we choose to test Prophet with a Top-Down approach ordered by department/country and DeepAR available in AWS Sagemaker.

DeepAR vs Prophet Top-Down

We made a backtest comparison between DeepAR (Sagemaker version) and a “Top-Down” Prophet over 2 years: before and after covid. DeepAR parameters were tuned using the methodology described in the DeepAR tuning section. And for Prophet, we take the default parameters for the 804 models (67 departments X 12 countries). Table 2 displays the global comparison between the two methodologies (globally and locally).

Table 2 — Comparison ofGlobal wape and Local WAPE (lower is better) between DeepAR and Prophet. The Local wape is the mean wape of the 51 000 time series.

The results are quite similar to the one of Amazon Forecast and DeepAR is globally and locally better than a Top Down Prophet. Besides, DeepAR is one model vs 804 models for the top down approach. So for the industrialization/deployment part (the number of countries will increase), DeepAR appeared as the best option.

Backtesting and Comparison with Baseline

The current baseline of the turnover prediction is done semi-manually by the different department managers: it is a mix of naive baseline (previous year seasonality) with human correction. We compare our solution with the predictions made manually starting on week 37 2021. Unfortunately, past comparisons were not possible due to the lack of history.

For this cutoff (week 36 2021), the global wape of DeepAR is 0.245. When we perform the comparison of the two predictions, the number of time series drops to 35500 because many departments are not monitored and there are outliers in the predictions done by business (value of -1e18). In Table 3, DeepAR clearly outperforms the business solution globally and locally. For 76% of the time series, DeepAR prediction has a lower WAPE than the business solution.

Table 3 — Comparison of business solution vs DeepAR predictions (cutoff at week 36 2021)

Figures 8 and 9 display respectively an example of an accurate manual forecast and an example where DeepAR outperformed.

Figure 8 — Example of time series where the manual forecast (blue) is better than DeepAR. For this department, DeepAR overestimated the turnover. This issue was found in other stores for the same department. WAPE_DEEPAR = 0.258, WAPE_BUSINESS=0.144.
Figure 9 — Example of time series where DeepAR (red) is better than the manual Forecast (blue). WAPE_DEEPAR = 0.123, WAPE_BUSINESS=0.247.

DeepAR Tuning

To obtain the good results shown above, we tune DeepAR. At first, we perform a hyperparameter tuning over:

  • context_length,
  • embedding_dimension,
  • learning_rate,
  • mini_batch_size,
  • num_cells,
  • num_layers.

The backtest period started in May 2021. We used the tuning job solution of Amazon Sagemaker, the best hyperparameters were very different from the default ones (see Table 4 ). It turns out that these parameters were unstable: we observed an important variability between successive predictions between September and October. We decided to limit the tuning job to the main parameters that AWS advised to tune:

  • context_length,
  • learning_rate,
  • mini_batch_size.

Epochs were limited to 250 with an early_stopping_patience of 50. Generally, it ends in less than 100 epochs.

Table 4 — Different hyper parameters tested.

Model monitoring

Because the target horizon is important, the WAPE over 16 weeks is not interesting to monitor weekly. That is why we investigated an indicator that we could monitor just after the inference. To do so, we implemented a baseline “prediction” in percentage over the rolling year (last 52 weeks) and we compared it with our prediction using the WAPE as a metric. We computed the wape for each time series (51000) and the indicator that we called historical wape, which is just the average. The idea of this indicator is to flag abnormal predictions (given the baseline) at a local and global level.

The baseline is just the median percentage of turnover over the past rolling years (with the covid reconstruction). Figure 10 shows the comparison between a prediction and the naive baseline.

Figure 10 — Comparison naive baseline.

The historical wape just tells us if the predictions are on average far from the baseline. It is also interesting to aggregate at a country, store, or department level to spot abnormal predictions of our model at a higher level. For instance, during the lockdown (15/12/2021–14/12/2021) in the Netherlands, the historical wape of Dutch stores increased above 35% (see Figure 11), while other European stores had a historical wape between 5% and 15%. This problem was expected: because we handle the lockdown with a posteriori reconstruction of the period, the model saw the raw data (Figure 12) and predicted an activity retake at a lower level than the reality.

Figure 11 — High value of the historical wape (green low wape, red important one) for Dutch stores (predictions done at the end of the lockdown: 2022–01–21).
Figure 12 — Turnover of a Dutch store and different predictions during January 2022. The one of 2022–02–14 is done with a reconstruction of the last lockdown.

Continuous Training

Based on our monitoring metrics, we define a threshold for alerting and starting a continuous training pipeline to improve our model performance. To do so, we look at the variations of the historical wape: generally, it is around 16% and when we launch a DeepAR training with a few epochs (less than 10 instead of 250), it goes above 40–50%. That is why we fix the global threshold at 30%.

The continuous training consists of a check of the hyperparameters of DeepAR that we used with a hyperparameter tuning job. The limit of the parameters tested are:

  • context_length: IntegerParameter(16, 150),
  • mini_batch_size: CategoricalParameter([256, 512, 1024, 2048]),
  • learning_rate: ContinuousParameter(0.0001, 0.1).

Conclusion

We successfully tested DeepAR to replace a business solution and given our preliminary results DeepAR clearly outperforms the current solution and other classical forecasting algorithms. During our experimentation, data quality and especially covid reconstruction were the most challenging and time-consuming tasks. It also rewarded us with good results. Our first test without the covid reconstruction led to a wape higher of 25%. The hyperparameters tuning of DeepAR was not as successful as we thought and tuning only the main parameters worked well. The similarity between DeepAR in AWS Sagemaker and DeepAR + in AWS Forecast was very interesting and helped us to converge on the right choice of parameters for our final model.

In an upcoming blog post, we are going to share how we deployed this model at scale. Stay tuned!

Very warm thank you to the Decathlon AI-Lab team.

--

--