Sales forecasting in retail: what we learned from the M5 competition

Maxime Lutel
Feb 3, 2021 · 13 min read

Our review of recurrent issues encountered in a sales forecasting project, and how we handled them for the M5 competition.


  • What machine learning model worked the best for this task?
  • Which features had the biggest predictive power?
  • How to tackle a dataset with intermittent sales?
  • How to deal with an extended forecasting horizon?
  • How to ensure model robustness with an appropriate cross-validation?

Using machine learning to solve retailers’ business challenges

The competition aimed at predicting future sales at the product level, based on historical data. More than 5000 teams of data lovers and forecasting experts have discussed for months about the methods, features and models that would work best to address this well-known machine learning problem. These debates highlighted some recurring issues encountered in almost all forecasting projects. And even more importantly, they brought out a wide variety of approaches to tackle them.

This article aspires to summarize some key insights that emerged from the challenge. At Artefact, we believe in learning by doing, so we decided to take a shot and code our own solution to illustrate it. Now let’s go through the whole forecasting pipeline and stop along the way to understand what worked and what failed.

Problem statement

Hierarchical times series forecasting

Our task is to predict sales for all products in each store, on the days right after the available dataset. It means that 30 490 forecasts have to be made for each day in the prediction horizon.

This hierarchy will guide our modeling choices, because interactions within product categories or stores contain very useful information for prediction purposes. Indeed, items in the same stores and categories might have similar sales evolution, or on the contrary they could cannibalize each other. Therefore, we are going to describe each sample by features that capture these interactions, and prioritize machine learning based approaches over traditional forecasting ones, to consider this information when training the model.

Two main challenges: intermittent values and an extended prediction horizon

The first one is that the time series we are working with have a lot of intermittent values, i.e. long periods of consecutive days with no sales, as illustrated on the plot below. This could be due to stock-outs or limited shelves’ area in stores. In any case, this complicates the task, since the error will skyrocket if sales are predicted at a regular level while the product is out of shelves.

The second one comes from the task itself, and more precisely from the size of the prediction horizon. Competitors are required to generate forecasts not only for the next week, but for a 4-week period. Would you rather rely on the weather forecast for the next day or for 1 month from now? The same goes for sales forecasting: an extended prediction horizon makes the problem more complex as uncertainty increases with time.

Feature engineering — Modeling sales’ driving factors


Calendar events such as holidays or NBA finals also have a strong seasonal impact. One feature has been created for each event, with the following values:

  • Negative values for the 15 days before the event (-15 to -1)
  • 0 on the D-day
  • Positive values for the 15 days following the event (1 to 15)
  • No value on periods more than 15 days away from the event

The idea is to model the seasonal impact not only on the D-day, but also before and after. For example, a product that will be offered a lot as a Christmas present will experience a sales peak on the days before and a drop right after.



  • Relative difference between the current price of an item and its historical average price, to highlight promotional offers’ impact.
  • Price relative difference with the same item sold in other stores, to understand whether or not the store has an attractive price.
  • Price relative difference with other items sold in the same store and same product category, to capture some cannibalization effects.

Categorical variables encoding

All categorical variables and some of their combinations have been encoded with this method. This results in very informative features, the best one being the encoding of product and store combination. If you wish to experiment other encoders, you can find a wide range of methods here.

One LightGBM model per store

As it is a tree-based method, it should be able to learn different patterns for each store, based on their specificities. However, we observed that training a model with samples from only one store resulted in a slight increase of forecast accuracy. Therefore, we have decided to train 10 different LightGBM models, one per store, which also had the advantage to reduce the global training time.

Tweedie loss to handle intermittent values

Without going into the mathematical details, let’s try to understand why this loss function is appropriate for our problem, by comparing sales distribution in the training data and the tweedie distribution:

They look quite similar and both have values concentrated around 0. Setting the tweedie loss as an objective function will basically force the model to maximize the likelihood of that distribution and thus predict the right amount of 0s. Besides, this loss function comes with a parameter — whose values are ranging from 1 to 2 — that can be tuned to fit the distribution of the problem at hand:

Based on our dataset distribution, we can expect the optimal value to be between 1 and 1.5, but to be more precise we will tune that parameter later with cross-validation. This objective function is also available for other gradient boosting models such as XGBoost or CatBoost, so it’s definitely worth trying if you’re dealing with intermittent values.

How to forecast 28 days in advance?

Making the most out of lag features

This concept is important to understand what features will be available at prediction time. Here, we are on day D and we would like to forecast sales for the next 28 days. If we want to use the same model — and thus the same features — to make predictions for the whole forecasting horizon, we can only use lags that are available to predict all days between D+1 and D+28. This means that if we use the 1-day lag feature to train the model, that variable will also have to be filled for predictions at D+2, D+3, … and D+28, whereas it refers to dates in the future.

Still, lags are probably the features with the biggest predictive power, so it’s important to find a way to make the most out of this information. We have considered 3 options to get around this problem, let’s see how they performed.

Option 1: One model for all weeks

The first option is the most obvious one. It consists in using the same model to make predictions for all weeks in the forecasting horizon. As we just explained, it comes with a huge constraint: only features available for predicting at D+28 can be used. Therefore, we have to get rid of all the information given by the 27 most recent lags. It is a shame as the most recent lags are also the most informative ones, so we have considered another option.

Option 2: Weekly models

This alternative consists in training a different LightGBM model for each week. On the diagram above, every model is learning from the most recent possible lags with respect to the constraint imposed by its prediction horizon. Following the same logic as the previous option, it means that each model can leverage all lags except those that are newer than the farthest day to predict. More precisely:

  • Model 1 makes forecasts for days 1–7, relying on all lags except the 6 most recent ones.
  • Model 2 makes forecasts for days 8–14, relying on all lags except the 13 most recent ones.
  • Model 3 makes forecasts for days 15–21, relying on all lags except the 20 most recent ones.
  • Model 4 makes forecasts for days 22–28, relying on all lags except the 27 most recent ones just like in option 1.

This method allows us to better capitalize on lag information for the first 3 weeks and thus improved our solution’s forecast accuracy. It was worth it because it was a Kaggle competition, but for an industrialized project, questions of complexity, maintenance and interpretability should also come into consideration. Indeed, this option could be computationally expensive and if we are aiming at a rollout on a whole country scale, it would require maintaining hundreds of models in live. In that case, it would be necessary to evaluate if the performance increment is large enough to justify this more complex implementation.

Option 3: Recursive modeling

The last option also uses weekly models, but this time in a recursive way. Recursive modeling means that predictions generated for a given week will be used as lag features for the following weeks. This happens sequentially: we first make forecasts for the first week by using all lags except the 6 most recent ones. Then, we predict week 2 by using our previous predictions as 1-week lags, instead of excluding more lags like in option 2. By repeating the same process, we always get recent lags available, even for weeks 3 and 4, which allows us to leverage this information to train the models.

This method is worth testing, but keep in mind that it is quite unstable as errors spread from week to week. If the first week model makes important errors, these errors will be taken as the truth by the next model, which will then inevitably be poorly performing, and so on. That’s why we decided to stick with option 2, that seems to be more reliable.

Ensuring model robustness with an appropriate cross-validation

Why cross-validation is critical for time series

The validation period during which the model is tested also has a greater importance when dealing with time series. Model performance and the optimal set of hyper-parameters can vary a lot depending on the period over which the model is trained and tested. Therefore, our objective is to find which parameters are the most likely to maximize performance not over a random period, but over the period that we want to forecast, i.e. the next 4 weeks.

Adapting the validation process to the problem at hand

Folds 1, 2 and 3 aim at identifying parameters that would have maximized performance over recent periods, basically over the last 3 months. The problem is that these 3 months might have different specificities than the upcoming period that we are willing to forecast. For example, let’s imagine that stores launched a huge promotional season over the last few months, and that it just stopped today. These promotions would probably impact the model’s behavior, but it would be risky to rely only on these recent periods to tune it because this is not representative of what is going to happen next.

To mitigate this risk, we have also included folds 4 and 5, which correspond to the forecasting period respectively shifted by 1 and 2 years. These periods are likely to be similar because the problem has a strong yearly seasonality, which is often true in retail. In case we had a different periodicity, we could choose any cross-validation strategy that has more business sense. In the end, we have selected the hyper-parameters’ combination with the lowest error over the 5 folds to train the final model.


These figures are indicative: the incremental accuracy also depends on the order in which each step is implemented.

Key takeaways

  • Work on a small but representative subset of data to iterate quickly.
  • Be super careful about data leakage in the feature engineering process: make sure that all the features you compute will be available at prediction time.
  • Select a model architecture that allows you to leverage lags as much as possible, but also keep in mind complexity considerations if you’re willing to go to production.
  • Set-up a cross-validation strategy adapted to your business problem to evaluate correctly your experiments’ performance.

Thanks a lot for reading up to now and don’t hesitate to reach out if you have any comment on the topic! You can visit our blog here to learn more about our machine learning projects.

Artefact Engineering and Data Science

Dev & Data Science @ Artefact

Artefact Engineering and Data Science

Artefact is a tech company dedicated to solving data challenges by combining state-of-the-art Machine Learning and advanced software engineering. We leverage our business knowledge to deliver tailor-made solutions and bring value to our clients. @ Artefact

Maxime Lutel

Written by

Data Scientist @Artefact

Artefact Engineering and Data Science

Artefact is a tech company dedicated to solving data challenges by combining state-of-the-art Machine Learning and advanced software engineering. We leverage our business knowledge to deliver tailor-made solutions and bring value to our clients. @ Artefact