[Review] Kaggle Corporación Favorita Grocery Sales Forecasting — Part 2

Post-competition Models and Model Descriptions

Ceshine Lee
Feb 4, 2018 · 8 min read

The Results from the “Ideal Settings”

  1. v12_lgb: 2017/07/26. only for predicting (store, item) without any sales in the last 56 days.
  2. v13: 2017/07/26 + 56-day filter. For hyper-parameter tuning.
  3. v14: 2016/08/17+ 56-day filter. I expect it to do a little better than v13.

The following table contains the results which were obtained from the “Late Submission” feature of Kaggle. The numbers in the model columns are the ensemble weights and the actual number of trained models involved (in the parenthesis). The “comp” under in v12 LGB column means the predictions from the v12 LGB models (3-model average) were used only for (store, item) combinations with no sales in recent 56 days. The local CV scores are not comparable between v13 and v14, and does not include the scores for (store, item) combination covered by v12 LGB models.

The local CV scores of v13 models seems to be consistent with leaderboard scores, so I’d say my validation is good enough. (The CV scores of v14 are only for reference. They should not be relied upon.)

The final row represents the ensemble I’d mostly likely to have picked in the competition, to hedge the risk of v14 blind training being a huge mistake. Its private score of .513 means I probably wouldn’t get top 3 results even if I was given more time and didn’t place that bad bet. So there’s that.

The training time of my models became too long for v13 and v14 because of the 56-day filter, so I only trained 2 models with different seeds for each hyper-parameter setting . I actually included dozens of models and around 10 types of model variations in the final submission. We might be able to get even lower score if we use stronger bagging and include more variations (some of them I’ll describe in the following sections), but I am not interested in using more computing resources to find out.


  1. Arthur Suilin’s Web Traffic Forecasting winner solution
  2. Sean Vasquez’s Web Traffic Forecasting solution

Features Used in the DNN Models

  1. Float series (all sales numbers are log-transformed)
  2. Integer series (mostly categorical variables and dummy variables)
  3. Derived features (features derived from float series that does not have a time dimension)

For float series:

  1. 56-day (store, item) sales prior to the first day we’d like to predict (year 2)
  2. (56+15)-day (store, item) sales in roughly the same date as (1) in the previous year. (year 1)
  3. 56-day (stores in the same cluster, item) total sales (year 1)
  4. (56+15)-day (stores in the same cluster, item) total sales (year 2)
  5. 56-day (store, items with the same class) average sales (year 1)
  6. (56+15)-day (store, items with the same class) average sales (year 2)

For integer series:

  1. (56+15)-day (store, item) onpromotion (year 1)
  2. (56+15)-day (store, item) onpromotion (year 2)
  3. (56+15)-day item sum(onpromotion) across all stores (year 1)
  4. (56+15)-day item sum(onpromotion) across all stores (year 2)
  5. Item class (repeats 56+15 times)
  6. Item family (repeats 56+15 times)
  7. Store type (repeats 56+15 times)
  8. Store cluster (repeats 56+15 times)
  9. Store ID (repeats 56+15 times)
  10. Store city (repeats 56+15 times)
  11. Day of month
  12. Month
  13. Year
  14. Day of week
  15. If the it has been more than 56 days since the first sale of the item in the store
  16. The current time step in the decoder (15 steps; decoder only; one-hot encoded)

Integer features 4 to 14 are converted to vectors by entity embedding.

For derived series:

  1. Yearly correlation coefficient of float series 1 and float series 2
  2. Yearly correlation coefficient of float series 1 and float series 4
  3. Yearly correlation coefficient of float series 1 and float series 5
  4. The mean of float series 1–6

The alignment of year 1 and year 2 is tricky. To predict the sales at time t, we want the sales at time t-1, all other year 2 features at time t, and year 1 features at time t-364. So every year 2 sales features are shifted 1 day to the left. It’s important to remember that when calculating correlation coefficients and shift the year 2 sales series back.

The derived features are calculated on the fly in a Dataset method. The other two types of features are written to numpy memmap files on the disk and will be read by the Dataset instance when needed (this saves a tremendous amount of memory).

Not all features are always used. Sometimes some features are dropped when training a model to increase overall variations. A dropout of 0.25 is also applied along the embedding dimension.

Feature Normalization

For example, this is the distribution of the log-transformed float series 1:

After normalization, it becomes:

Model Structures

  1. LSTNet from “Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks”. Re-implemented based on GUOKUN LAI’s implementation. [2] (Use GRU as RNN units)
  2. Seq2seq (Decoder/Encoder) with attention (general) [3]. Two decoder variants: (a) only take the hidden states from the last time step (b) feed the predictions from the last time step with scheduled sampling (exponential decay)[4].
  3. Simple RNN/CNN + MLP with MLP depending only on the outputs of the RNN/CNN component from the last one(RNN) or seven(CNN) time steps and some non-leaking future features. Also has a optional autoregressive component similar to LSTNet.
An overview of the Long- and Short-term Time-series network (LSTNet)[2]

For model structure 3 and 4, LSTM, GRU, SRU[3], QRNN[4] are all available. I’ve found SRU and QRNN can obtain quite good validation loss with less training time, so they are heavily used when doing feature selection. The training time reduction is not as much as the papers reported, though. It might has something to do with my code not optimized enough.

The model 3(a) with scheduled sampling actually has the best local CV scores, but it takes very long to train, and decay schedule is very hard to tune. Hence I did not include it in the post-competition models.


I tried Yellowfin after reading its paper[7] and thought it was really promising. However its behavior was really strange. I had to do a lot of hand-tuning to make it almost on par with RMSProp, so it was eventually dropped. Recently I tried the official PTB example with both its PyTorch and Tensorflow implementation, and found that the Yellowfin optimizer still underperformed compared to Adam. Not sure what the problem was (I used Python 2.7, Tensorflow 1.1 and PyTorch 0.2.0 as specified in the READMEs.).


I used the almost the same model structure on other dataset without non-uniform sample weight and the trained PyTorch models were perfectly reproducible.

Sample Predictions

There are some really bizarre series that are almost impossible to predict, for example:

The reason of the abrupt might be items out of supply or being removed from the shelf. We can only guess.

The Code

There are some experimental code blocks that need to be removed. The model ensemble script currently involves hand-picking models, which should be automated instead. There are some works to be done.

I’ll update this post with a link to the Github repo, or write a Part III for that. We’ll see then.

Update on 2017-02-11: The incomplete solution on Github:


  1. Lai, G., Chang, W.-C., Yang, Y., & Liu, H. Modeling Long-and Short-Term Temporal Patterns with Deep Neural Networks.
  2. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation.
  3. Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.
  4. Lei, T., Zhang, Y., & Artzi, Y. TRAINING RNNS AS FAST AS CNNS.
  5. Bradbury, J., Merity, S., Xiong, C., & Socher, R. (2016). Quasi-Recurrent Neural Networks.
  6. Zhang, J., Mitliagkas, I., & Ré, C. (2017). YellowFin and the Art of Momentum Tuning.


Towards human-centered AI. https://veritable.pw

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store