[Review] Kaggle Instacart Competition

Quick Review and Brief Discussion on Neural Network Models

Ceshine Lee

Published in

Veritable

5 min readAug 20, 2017

Instacart Market Basket Analysis

Which products will an Instacart consumer purchase again?

www.kaggle.com

Quick Review

The Instacart Market Basket Analysis competition on Kaggle is really a surprise for me. I was really focusing on implementing RNN models using PyTorch as a practice. The traditional tabular data format and gradient boosting models were derived from the prepared data for the RNN models as a baseline. However, after adopting Faron’s F1 optimization implementation, I found my “baseline” very competitive (reached 10th position at the time), and actually get me my first gold medal, so more time was invested on the gradient boosting models as a result.

The final week was really crazy. One had to constantly improving her models to maintain her leaderboard position. The public scores of my ensemble models became less consistent with local CV in the end. I was a bit paranoid and suspected having a bug in the latest features, therefore did not choose the model with the best CV in the final submissions. It would’ve put me in the 16th place, but not getting a gold medal anyway.

The whole pipeline can be described as:

Merge the data, split by users, and dump the data into one pickle file per user. (So the data loader for PyTorch can create batches on the fly)
Create tabular features from the dump files and store them into a bcolz ctable. (This allows me to create more features my RAM can handle and only load a subset of them in the following steps) Only products appeared in the last 10 orders are considered as candidates.
Train 10-fold LightGBM and MLP models with different subset of features and different parameters.
Train a meta LightGBM model on the predictions of the models from step (3). This will yield ~ 0.0005 gain in scores.
Train a None basket classifier based on a subset of the features and the product reorder predictions from step (4).
Use the F1 optimization code to evaluate the performance and create submissions.

I didn’t bother to create features specifically for None classifier in step(5), but in hindsight it might provide some performance boost. And step (4) might be replace with simple weighted averaging. My final RNN model did not contribute positively in the ensemble so was not included.

Most top performers use similar models with better features and some special tuning tricks. Many of them generously shared their approach on the forum, so I won’t spend time listing my somewhat inferior features.

Neural Network Models

Now we come back to what brought me to this competition — figuring out how to learn the features automatically from neural networks, without having to hand-craft features myself.

Sean Vasquez did a fantastic job with his solution which relies purely on learnt features. The code was written superbly. However, it requires 64GB of RAM to run and I have only 16GB… Therefore I spent some time modifying my code to be closer to Sean’s and you can find it on Github (WIP) :

ceshine/kaggle-instacart

kaggle-instacart - Solution for Kaggle Instacart Market Basket Analysis Competition

github.com

As mentioned in the last section, the data is split by users and save as a single pickle file (basket.preprocessing.prepare_users):

joblib.dump(
    res, "data/users/{}/{}.pkl".format(
        df.user_id.iloc[0] // USER_CHUNK, df.user_id.iloc[0])
)

USER_CHUNK is set to 1000. Opening data/users/ in the file manager (accidentally) might freeze the GUI if instead we store the files in the same folder. (Modern file systems support hundreds of thousands of files in the same folder, but some of the tools are not comfortable with that.)

The features are dynamically assembled and split into batches with a fixed size (basket.models.rnn_product.data_loader). The InstacartDataLoader class is based on PyTorch’s DataLoader. The custom dataset of PyTorch is not applicable because we want to sample by user and have non-standard return values. One of the disadvantage is that rows from the same user is highly likely to be in the same batch, but I think this is a necessary trade-off.

The model combines 2-layer LSTM and 3-layer causal CNN with dilated convolution. Sean used 1-layer LSTM and 6-layer CNN with dilated convolution, but I find this structure more effective in my setting (only considers products appeared in the last 10 orders). In fact, I didn’t find causal CNN provide any noticeable gain in performance. This part of the training (basket.models.rnn_product.model) should run smoothly with no less than 8GB of RAM.

BTW, I found spotlight a really interesting project and it helps me understand how causal CNN (which is new to me) can be implemented in PyTorch:

maciejkula/spotlight

spotlight - Deep recommender models using PyTorch.

github.com

Causal CNN with dilated convolution (source: spotlight)

The state of the final fully connected layer is extracted and feed into a LightGBM model. Currently I have only implemented one neural network model, and the performance of LightGBM model isn’t much different from directly apply sigmoid function on the last layer. Perhaps LightGBM will perform much better with states from multiple models. But I’ll have to reduce the size of the last layers or the states won’t fit into memory (16GB RAM can handle ~ 80 features).

There’s a lot more to be done. The current result is far from competitive. I might need to implement more ideas from Sean’s solution. And Colin Morris also provided some interesting insights:

Anecdotally, sampling/weighting at the user level seemed to be essential to getting good results with this model.

However, I’m a bit burnt out on this one and need to take a break. Hopefully I’ll come back later and push this solution to at least around 0.4050 private score.

2017/09/08 Update

I’ve added a Bernoulli mixture model for product-level features. It’s basically a LSTM + FC model structure, but uses the last FC layer to simultaneously create multiple predictions and weights to these predictions (each use half of the layer output). And the final prediction is a weighted sum of these predictions.

I’ve tried to combine it with the previous model and feed it to GBM meta model, but there’s no significant improvement to the public nor the private score. Maybe I should try add some tabular features to the meta model.