[Review] Kaggle Corporación Favorita Grocery Sales Forecasting — Part 2

Post-competition Models and Model Descriptions


The Results from the “Ideal Settings”

To reiterate the settings:

  1. v12_lgb: 2017/07/26. only for predicting (store, item) without any sales in the last 56 days.
  2. v13: 2017/07/26 + 56-day filter. For hyper-parameter tuning.
  3. v14: 2016/08/17+ 56-day filter. I expect it to do a little better than v13.

The following table contains the results which were obtained from the “Late Submission” feature of Kaggle. The numbers in the model columns are the ensemble weights and the actual number of trained models involved (in the parenthesis). The “comp” under in v12 LGB column means the predictions from the v12 LGB models (3-model average) were used only for (store, item) combinations with no sales in recent 56 days. The local CV scores are not comparable between v13 and v14, and does not include the scores for (store, item) combination covered by v12 LGB models.

The local CV scores of v13 models seems to be consistent with leaderboard scores, so I’d say my validation is good enough. (The CV scores of v14 are only for reference. They should not be relied upon.)

The final row represents the ensemble I’d mostly likely to have picked in the competition, to hedge the risk of v14 blind training being a huge mistake. Its private score of .513 means I probably wouldn’t get top 3 results even if I was given more time and didn’t place that bad bet. So there’s that.

The training time of my models became too long for v13 and v14 because of the 56-day filter, so I only trained 2 models with different seeds for each hyper-parameter setting . I actually included dozens of models and around 10 types of model variations in the final submission. We might be able to get even lower score if we use stronger bagging and include more variations (some of them I’ll describe in the following sections), but I am not interested in using more computing resources to find out.


Many of the the features and model structures were inspired by:

  1. Arthur Suilin’s Web Traffic Forecasting winner solution
  2. Sean Vasquez’s Web Traffic Forecasting solution

Features Used in the DNN Models

There are three types of features in my models:

  1. Float series (all sales numbers are log-transformed)
  2. Integer series (mostly categorical variables and dummy variables)
  3. Derived features (features derived from float series that does not have a time dimension)

For float series:

  1. 56-day (store, item) sales prior to the first day we’d like to predict (year 2)
  2. (56+15)-day (store, item) sales in roughly the same date as (1) in the previous year. (year 1)
  3. 56-day (stores in the same cluster, item) total sales (year 1)
  4. (56+15)-day (stores in the same cluster, item) total sales (year 2)
  5. 56-day (store, items with the same class) average sales (year 1)
  6. (56+15)-day (store, items with the same class) average sales (year 2)

For integer series:

  1. (56+15)-day (store, item) onpromotion (year 1)
  2. (56+15)-day (store, item) onpromotion (year 2)
  3. (56+15)-day item sum(onpromotion) across all stores (year 1)
  4. (56+15)-day item sum(onpromotion) across all stores (year 2)
  5. Item class (repeats 56+15 times)
  6. Item family (repeats 56+15 times)
  7. Store type (repeats 56+15 times)
  8. Store cluster (repeats 56+15 times)
  9. Store ID (repeats 56+15 times)
  10. Store city (repeats 56+15 times)
  11. Day of month
  12. Month
  13. Year
  14. Day of week
  15. If the it has been more than 56 days since the first sale of the item in the store
  16. The current time step in the decoder (15 steps; decoder only; one-hot encoded)

Integer features 4 to 14 are converted to vectors by entity embedding.

For derived series:

  1. Yearly correlation coefficient of float series 1 and float series 2
  2. Yearly correlation coefficient of float series 1 and float series 4
  3. Yearly correlation coefficient of float series 1 and float series 5
  4. The mean of float series 1–6

The alignment of year 1 and year 2 is tricky. To predict the sales at time t, we want the sales at time t-1, all other year 2 features at time t, and year 1 features at time t-364. So every year 2 sales features are shifted 1 day to the left. It’s important to remember that when calculating correlation coefficients and shift the year 2 sales series back.

The derived features are calculated on the fly in a Dataset method. The other two types of features are written to numpy memmap files on the disk and will be read by the Dataset instance when needed (this saves a tremendous amount of memory).

Not all features are always used. Sometimes some features are dropped when training a model to increase overall variations. A dropout of 0.25 is also applied along the embedding dimension.

Feature Normalization

The float series are normalized by subtracting their (series-wise) means, and then divided by their corresponding constant numbers, which are the standard deviations of all the residuals (after subtracting means) for that type of series, e.g. year 2 (store, item) sales, in the training data. The normalized values are clipped by (-3, 3) to reduce the influence of outliers.

For example, this is the distribution of the log-transformed float series 1:

After normalization, it becomes:

Model Structures

  1. Transformer from “Attention is All You Need” implemented by Yu-Hsiang Huang [1]
  2. LSTNet from “Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks”. Re-implemented based on GUOKUN LAI’s implementation. [2] (Use GRU as RNN units)
  3. Seq2seq (Decoder/Encoder) with attention (general) [3]. Two decoder variants: (a) only take the hidden states from the last time step (b) feed the predictions from the last time step with scheduled sampling (exponential decay)[4].
  4. Simple RNN/CNN + MLP with MLP depending only on the outputs of the RNN/CNN component from the last one(RNN) or seven(CNN) time steps and some non-leaking future features. Also has a optional autoregressive component similar to LSTNet.
An overview of the Long- and Short-term Time-series network (LSTNet)[2]

For model structure 3 and 4, LSTM, GRU, SRU[3], QRNN[4] are all available. I’ve found SRU and QRNN can obtain quite good validation loss with less training time, so they are heavily used when doing feature selection. The training time reduction is not as much as the papers reported, though. It might has something to do with my code not optimized enough.

The model 3(a) with scheduled sampling actually has the best local CV scores, but it takes very long to train, and decay schedule is very hard to tune. Hence I did not include it in the post-competition models.


As mentioned in Part I, I use ReduceLROnPlateau to schedule learning rate. For model structure 1 and 4(CNN) Adam optimizer is used. RMSProp optimizer is used for the rest.

I tried Yellowfin after reading its paper[7] and thought it was really promising. However its behavior was really strange. I had to do a lot of hand-tuning to make it almost on par with RMSProp, so it was eventually dropped. Recently I tried the official PTB example with both its PyTorch and Tensorflow implementation, and found that the Yellowfin optimizer still underperformed compared to Adam. Not sure what the problem was (I used Python 2.7, Tensorflow 1.1 and PyTorch 0.2.0 as specified in the READMEs.).


I found out that the same seed does not generated the same PyTorch model mid-competition, and spent quite some time trying to find out why. I became pretty sure that the non-deterministic behavior came from customized weight calculation for perishables. (I explicitly create a weight vector and multiply it with the loss vector before doing back-propagation, and keep the weight vector and the product vector to track the learning curve.) Tensorflow community has this discussion about the non-deterministic mean and sum reduction. I think PyTorch should have the similar problem. It really make sense because parallel summation needs to ensure the exact same workload split and reduce order to guarantee the same result (the problem of float point precision).

I used the almost the same model structure on other dataset without non-uniform sample weight and the trained PyTorch models were perfectly reproducible.

Sample Predictions

(lstm-tf is a single model of model structure 3(a).)

There are some really bizarre series that are almost impossible to predict, for example:

The reason of the abrupt might be items out of supply or being removed from the shelf. We can only guess.

The Code

I have not started preparing the code that is ready to be published yet….

There are some experimental code blocks that need to be removed. The model ensemble script currently involves hand-picking models, which should be automated instead. There are some works to be done.

I’ll update this post with a link to the Github repo, or write a Part III for that. We’ll see then.

Update on 2017-02-11: The incomplete solution on Github:


  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. Attention Is All You Need.
  2. Lai, G., Chang, W.-C., Yang, Y., & Liu, H. Modeling Long-and Short-Term Temporal Patterns with Deep Neural Networks.
  3. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation.
  4. Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.
  5. Lei, T., Zhang, Y., & Artzi, Y. TRAINING RNNS AS FAST AS CNNS.
  6. Bradbury, J., Merity, S., Xiong, C., & Socher, R. (2016). Quasi-Recurrent Neural Networks.
  7. Zhang, J., Mitliagkas, I., & Ré, C. (2017). YellowFin and the Art of Momentum Tuning.