Forecasting beer sales for HEINEKEN’s customers

Dennis Ramondt
Nov 30, 2017 · 7 min read

As one of the world’s leading brewers, HEINEKEN works together with its customers to offer a diverse array of over 250 brands to consumers in over 170 countries, of which 70 have breweries present. One of the leading challenges in serving consumers is to ensure on-shelf-availability of its products in retail outlets. Market research has shown that when consumers consistently notice their favourite brand is missing from the shelves, they may quickly make the choice for a competitor’s brand instead. This blog outlines a tutorial to get started with retail sales forecasting, a necessary component in preventing such situations.

Advanced Analytics for sales and stock predictions

On-shelf-availability is in the interest of both HEINEKEN and its retail customers, which is why they try to leverage their joined analytics expertise and data sources to predict stock-out events and determine which actions to take to prevent them from happening. Data on stocks, sales, orders and deliveries throughout the supply chain are needed to create adequate predictive models and find the cause of stock-outs. For example, production issues at the brewery may cause stocks to run out, but delivery problems or underestimated orders may cause stock-outs at distribution centers and retail outlets. Due to this complexity, lead times are important. If you anticipate unusually high demand at outlets due to promotions, you will need to know several weeks in advance, in order to brew the necessary extra volumes of beer.

The newly established HEINEKEN Insights Lab supports Operating Companies with advanced analytics capabilities. In one of its current experiments, the Insights Lab is creating a forecasting model for stock levels and sales in retail outlets. HEINEKEN on-shelf-availability experts who work internally at customers’ headquarters can then use a dashboard to generate predictive insights, and take action to prevent stock-outs. The tool is meant as an addition to the existing inventory replenishment systems that customers use.

In this blog, we outline a simplified version of the retail sales forecasting approach taken by the Insights Lab tool, an approach that leverages the large amounts of data available in modern supply chains.

Data protection disclaimer

In all of its experiments, the HEINEKEN Insights Lab takes special care to be compliant with current data protection laws, as well as anticipate upcoming regulations, such as the GDPR. Although this blog uses freely available open datasets, in the actual experiment data was shared voluntarily by customers, with agreement on what it would be used for.

Creating a sales forecast: the problem

For a demonstration, we use data from the Walmart Recruiting — Store Sales Forecasting Kaggle competition. It has 3 years of weekly sales by store and department of Walmart stores. Note that our method can be applied for store — product combinations in the same way. Each store — department combination (3331 combinations in total) can be considered as an individual time series, making the problem extra complex. The traditional approach is to fit a simple model to each time series individually, but this does not leverage the information coming from shared trends and seasonality, neither does it allow the more time-consuming training of complexer models.

Our approach here is to train a single model for all time series, including store and department (and indeed other features) as explicit features in the model (besides the usual engineered time series features), and selecting an algorithm that can deal with such a high-dimensional categorical feature space. We also use a cross-validation strategy that mimicks daily retraining in real-life situations.

We start by reading in the raw data, and joining the sales data with the provided feature set. These can be found on the Kaggle competition’s page.

Feature engineering

The strength of Machine Learning methods when applied to time series problems is that they don’t suffer as much from the usual challenges of time series forecasting. With serial correlation, non-stationarity and heteroscedasticity, the assumptions of Ordinary Least Squares often don’t hold, and it can be time-consuming to determine the right specification of ARIMA models. These issues can at least in part be tackled using Machine Learning with the somewhat ‘brute force’ approach of adding many features that capture such behaviours, such as lags and datetime indicators. Note that multicollinearity or the curse of dimensionality may still occur, so techniques such as recursive feature elimination or dimensionality reduction should be considered when actually tuning a model like this. As the sales appear to be highly seasonal, we include week numbers as a categorical feature. Each feature engineering step is wrapped in a suitable class, so as to allow the use of pipeline methods.

A commonly used but very effective feature is the exponentially weighted moving average. In fact, many traditional sales forecasting systems use this to forecast sales directly. Rather than a normal moving average, which weights all observations in the rolling window equally, it uses all past observations, but with exponentially declining weights the further one goes back in time. This way, it captures a short-term average, without suffering too much from (recent) outliers. The selected alpha value determines how fast these weights decay. The following class adds several smoothed features based on a list of alphas.

Of course, the most common features in time series Machine Learning are lagged features. The below class creates lagged features based on a dictionary that specifies which columns should be given what lag. It contains both past and current (negative and zero) and future (positive) lagged features. The latter (positive lags 1 and 2) include the target variable, as well as information about the future which we may reasonably expect in real life, such as temperatures (from a weather forecast) or whether it is a holiday.

The following uses a class from the BDR public repository: PdFeatureChain allows the chaining of preprocessing steps, without converting data frames to numpy arrays every time, as sklearn does. Note that the lc1 step creates lags by weeknumber, meaning that they show the sales of last year during that week, which turns out to be a quite powerful feature. lc2 then creates the usual week-on-week lags.

Creating time series splits

In real applications, it is often desirable to retrain a forecasting model regularly, in order to capture the effect of short-term trends as often as possible. A cross-validation strategy that has test splits for every consecutive week can mimick this behaviour and thus show how the model would perform in reality. The IntervalGrowingWindow class from the BDR public repository allows us to create a growing window of train and test splits on a given time interval (1 week in our case). Note that if you would want to use such cross-validation for extensively grid-searching optimal model parameters, you should probably increase test_size so such an extent that only 5–10 splits remain.

All that remains is to convert our data frame to a numpy array and prepare it for cross-validation. The following function forces columns to floating dtype, and factorizes them if it encounters strings. It’s important you do this transformation, because mixed dtype (dtype=object) numpy arrays tend to get copied unecessarily accross joblib workers as sklearn parallelizes the cross-validation procedure.

This brings our feature set to:

Initializing the model

As we are implementing a very costly cross-validation strategy (84 splits!), we need an algorithm that is both accurate and trains fast. In this example, we use the LightGBM library for gradient boosting, which achieves accuracies close to XGBoost, but with greater speed. It also supports the use of categorical features, through a procedure in which multiple categories can be selected in each tree split. A defining feature of sales forecasting is the fact that we’d rather overforecast than underforecast, because if we are using the sales forecast to plan restocking, we need to make sure that we have enough and stores don’t run out over time. To this end, we employ an asymmetric objective function that slightly penalizes negative error, by tweaking the squared loss:

We can feed it to the gradient booster by providing the gradient and hessian (first and second order derivatives):

We can further wrap our estimator in sklearn’s MultiOutputRegressor, meaning that we will use the same feature set to forecast multiple target variables separately. This is useful in sales forecasting, because due to production and delivery lead times, we may want to forecast multiple weeks (or indeed days) ahead. Note that in a well-tuned approach one would rather select an individual featureset per day/week-ahead prediction.

For the cross-validation procedure, we use a slightly adjusted version of sklearn’s cross_val_predict method, as by default this does not allow train-test splits that are not the same size as the full dataset (which is inherently the case for growing window CV splits. The adjustment can be found below the post.

Evaluating results

We can now append the predictions to the original dataset, and compare them to the target using metrics such as MSE (or MAE, although this makes less sense since our objective function was a form of squared loss) or visualize selected store — department combinations. As it turns out, last year’s sales were by far the most relevant features, especially for predicting sales during holiday peak periods.

Concluding: how to create an inventory management system

Following these steps, you now have a working sales forecasting model for retail outlets! Extending this into an inventory management tool is relatively easy, all one needs is an opening stock for the first day of making predictions, and by cumulatively adding the expected sales, one can make suggestions for deliveries each week (or indeed multiple weeks in advance).

Additional material: custom CV predict method

Dennis Ramondt

Written by



More From Medium

More from bigdatarepublic

More from bigdatarepublic

Pachyderm for data scientists

More on Data Science from bigdatarepublic

More on Data Science from bigdatarepublic

A Review of Netflix’s Metaflow

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade