5 tips to better take promotional data into account

The path to developing a high-performance demand forecasting model — Part 2

Published in

Artefact Engineering and Data Science

10 min readMay 26, 2021

TL;DR

In this following article of a large series of posts dedicated to demand forecasting, we will focus on how to model promotions, a key driver in sales forecasting, looking at what a typical promo dataset looks like, how features should be crafted, with a step-by-step example in Python, and how complex promotional granularity can be dealt with.

Context

When forecasting demand as retailers, we often can leverage several useful data sources such as historical sell-in, product and customer hierarchies, holidays, sell-out and promotions. It is important to pay particular attention to the latter as, in real life, promotional activity is often much more than just a dummy flag you should add as a feature to your model. It is a real complex business mechanism which can yield additional performance to your model if processed well.

Tips 1: Understanding promotions referentials

Sales dataset often do not include promo data. You need to plug in a specific referential into your training data. Promotional data often come as a set of columns, in the format of a promotion plan with promotion characteristics such as:

SKU concerned by the promotion
Retailer benefiting from the promotion
Start and end dates
Promotion ID (unique ID for identifying a promotion for a given SKU, retailer, start date, end date)
Promotion type (e.g. 3 for 2)

Before starting the feature engineering and modeling parts, it is recommended to perform interviews with the business owners in order to understand how promotion are created and handled. Let’s take the example of dates. In the case of a sell-in prediction, a promotion will be associated with several dates:

official start and end dates for shipping orders from the distributor to the retailers (1)
and official start and end dates of promotions set up in stores by retailers (2)

It is really important to understand which dates impact the most the target variable and to check with the business owners if there are any specificities to take into account (e.g. if some retailers may apply the promotions before or after the official start and end dates).

Performing Exploratory Data Analysis (EDA) can help you understand the fluctuations and the impact promotions can have on the target variable. For example, below one can see that a particular type of promotion has a greater impact than others.

Example of the impact of two different promotions on the sales of a given product

EDA can also be performed to validate insights previously highlighted by business teams. Below, it seems retailers are starting promotions well before the official start date.

Example of difference between official and real start date of promotions

After the exploration, we need to process the raw promotion data in order for it to be used. The dates have first to be expanded and processed in order to have a continuous timeline. Some of them may need to be shifted based on what was found during the exploration and business interview phases.

Illustration of promotion data before and after preprocessing

Tips 2: Choosing the right granularity

As your EDA will show, promotions impacts differ greatly across products, retailers (or stores if you’re working with sell-out data), types of promotions. Ideally, you’d like to be as much precise as possible and keep a SKU x retailer x promotion type granularity.

For example, you may have two different geographic granularity: retailer and warehouse (i.e. a warehouse contains severals retailers). By plotting your time-series at each granularity, you may find out that the impact of promotion is really visible at retailer level but seems smoothed for warehouses. This can be explained by the fact that not all retailers of a the warehouse are impacted by the promotion in the same way. Therefore in that example, it is preferable to work at the retailer granularity.

Difference of the promotion impact depending on the granularity

Once the EDA has been done and the promotion data is at the right granularity, the objective is to create the most relevant features for future planned promotions for which we want to predict the associated sales.

Tips 3: Crafting the right features

People may think that just adding a dummy variable in your training dataset will be enough. This works to guide the model in understanding why demand or sales is higher at one point in time. However, it is a really poor way to model how promotion may impact sales. Typically, some promotion types may be more efficient than others, the impact of promotions may also be higher near the start date, lower after (as there are few people left that could benefit from the reduction).

A more sophisticated feature we found useful when using boosting algorithms is to compute sales rolling means so as to give your model insights about how much each promo type was successful in the past.

1. Theory

The idea behind this kind of feature is to measure, for a given promotion, the average volume recently generated by “similar promotions” in the past. We are going to compute the average historical sell-in on a similar scope (same promotion type, same SKU, same retailer) on a rolling window with a given horizon (e.g. on the 7 past days).

Example of a rolling mean with a 7-days window

For this kind of features, a particular attention should be paid to data leakage, especially when setting the time horizon.

2. Python implementation

Let’s see how to implement a 7-days rolling mean feature step by step in Python. First of all, let’s define our DataFrame with the following information:

SKU
Retailer
Promotion type
Promotion ID
Date
Sell-in (our target variable)

Once created, our DataFrame looks like this:

Next we will create two important columns : the start date of promotion and the rolling mean (empty for now).

Now, the DataFrame should look like this:

*Head of the DataFrame with the two new columns*

From there, we can start filling in the promo_rolling_mean column. Remember that the goal is to compute the mean of sell-in of previous similar promotions but the notion of similarity can be tricky. In the best case scenario, we have in our history a promotion with the same type, for the same retailer, for the same SKU. In worst case scenario, there is a new promotion with a new type for which we do not have any history for any SKU, any retailer. Therefore the idea is to define several levels of granularity for which we will see if we have a history and therefore a possibility to calculate a rolling mean, starting from the most granular level (e.g. SKU x retailer x promotion type) to the least granular level (e.g. SKU).

For example, let’s take SKU : 1, retailer : A, promotion type : 1, date : 2020–01–01. We look for a similar promotion in the past. Lucky for us, there has been a promotion with the same promotion type, for the same SKU, the same retailer (i.e. the most granular level) in 2019 (promotion_id = ‘A1–2019’). Thus we will take the mean of the sell-in for the 7 most recent dates where this kind of promotion happened. In other cases we may not find any match for this granularity so we will be looking for a match by SKU and promotion type only. Again, if there is no match, we will finally take the mean at the SKU level only.

At the end of the for loop, as you merge this new feature with the train set, you must pay attention at data leakage and only take rolling means computed for dates before each observation date (the date for which you want to make forecasts). Here, we have decided to use the merge_asof method. This allows us to merge two datasets avoiding exact matches. The idea behind it is: do not take the exact date match (with the allow_exact_matches=False parameter), but take the previous ones (with the direction=”backward” parameter).

Here is what our dataset looks like with the rolling mean feature filled after this step:

Final DataFrame with the rolling mean feature

We can first see that there are some missing values for the rolling mean feature at the top of the DataFrame. This is normal and is due to the fact that for the first rows, we do not have any history on any SKU, any retailer, therefore no possibility to compute a rolling mean. This is the only case where the rolling mean will be empty, any other case can be handled by the definition of the granularity aggregation levels.

For example, for the specific row we have defined at the beginning (SKU: 1, Retailer A, Promotion type: 3, Date: 2020–06–01), the one for which we don’t know the sell-in yet, the rolling mean value will be the mean of the sell-in for the most similar and recent promotion. In our case, there is no history for the promotion type 3 for the retailer A, but there is for retailer B. Therefore, the rolling mean value will be the mean of sell-in for SKU=1, Promotion type= 3, Retailer=B, here: mean([134, 146]) = 140.

Example for a new promotion type for a couple SKU x retailer

This logic can be extended to several other cases that can be encountered in this type of projects. For example, an additional level could be created which would be the product family and which could be used in the case where there is no history for a given product. In that case we will take the average based on products belonging to the same family. It is therefore important to think about these levels of granularity and prioritize them according to your own definition of “promotion similarity” which can be based on your EDA or business insights, for example.

Instead of the rolling means, you can also measure the promotional uplift (i.e. the additional volume generated by a given promotion) for a given product and a given customer. The idea would be to compute a ratio between sales during a given promotion and the sales without any promotion instead.

Tips 4: Dealing with big data

Working at such a level of granularity can drastically increase complexity and the need for computational power. If like us you are dealing with hundreds of SKUs, retailers and several years of daily historical sell-in, it will be essential to find a way to parallelize computations. For example, if you have no need to get any information from other SKUs for your promotion features, you can partition your data on the sku column and use distributed computing. We found it useful to use Dask for this task:

Tips 5: Taking into account demand transfers across products

Don’t forget that each SKU sales will be impacted by its promotions, but also by promotions from substitutable products, such phenomena is known as cannibalization. In order to have a performant model, it is mandatory to anticipate the potential downlift on some products resulting from cannibalization.

In order to be able to model the phenomenon, we first have to detect the cannibalization relationships between products. Two main approaches can be distinguished:

Use an existing mapping: if such mapping exists, all we have to do is to add this information to our dataset (e.g. a column “associated cannibalizing product”) and create specific features based on that.
Detect the cannibalization relationships automatically: the idea will be to establish relationships by looking at product similarities and behaviors in the past.

Automatic detection deep dive

One possibility to automatically detect cannibalization relationships is to use correlation scores. The idea is to bring together products that are really likely to cannibalize each other, not on the basis of their category, but on the basis of correlations between the evolution of their historical sales. Correlation scores are calculated for each pair of products and if they are strongly negative then we can assume that these products are cannibalizing each other

Cannibalization features deep dive

From these cannibalization relationships we can create features following the same approach as for direct promotions. For example:

Associate to each cannibalized product, the promotion rolling mean computed for the cannibalizing product. The “volume” in those features is therefore the “volume”of the cannibalizing products.
Associate to each cannibalized product, the downlift (loss of volume) that generally appears when the associated cannibalizing product is sold. The “volume” in those features is therefore the “volume”of the cannibalized products.

Results and conclusion

In our projects, we have observed that, in most cases, rolling mean features tended to work better for the model than uplift/downlift features. For example, for a given country, the rolling mean features resulted in a 2.8% increase in forecast accuracy while the uplift features resulted in a 2% increase.

However, each project is different and our main learning is that the exploration phase is essential and serves as a basis for the creation of features later on. It is necessary to really understand how promotions work and their impact in order to model them correctly. This involves discussions with the business owners as well as Exploratory Data Analysis.

A big thank you to all the team who worked on this mission who made it possible to write this article, especially Kasra Mansouri and Cédric Magnan. Stay tuned for the next articles in this series and do not hesitate to reach out if you have any comment.