Generating Features to Predict Instacart Customer Purchases

Dimitri Linde
5 min readAug 1, 2017

In my last post, I explored Instacart’s recently released dataset of 3.2 million customer orders. Instacart was in part motivated to release this customer data to attract would-be machine learning engineers, initiating a Kaggle competition in which modelers predict the previously ordered products in each customers’ latest order. With this post, I’m going to walk through a few simple features that collectively have a high degree of predictive power for that purpose when input into machine learning algorithms.

To get a sense of individual ordering behavior, which can vary significantly from the sample as a whole, I first group users by products in their order history and tabulate, for every user, the number of times a user has ordered a specific item. Grouping users by their products forms the basis of a binary classification problem — my model will predict whether the item is or is not in a given customer’s latest order. Dividing the number of times a user has purchased an item by their total number of prior orders enables me to calculate an order rate as well. I intuitively expect items that a customer orders a lot to be in their next order; things the customer orders infrequently I do not. The order rate is limited, however, in not making any distinction between customers with 3 previous orders and customers with 30 orders, the order rate for the latter being intuitively more robust.

Order rates for User 1. User 1 has made 10 orders, all including Beef Jerky but only one of which included Bartlett Pears.

The order rate can also mislead by not picking up clustering in ordering behavior. In the fall, each trip I take to the grocery store results in a purchase of honeycrisp apples. But at a certain point, the apple season is over, the honeycrisps are no longer in the stores, and I won’t buy another (decent) apple for a while. If you took a snapshot of seven of my orders, however, though maybe my last three orders orders did not include honeycrisps, my first four did, and my 57% order rate strongly suggests I’ll order them in my next trip. To counteract this problem with order rate and send a negative signal of a product being in a customer’s order, I add an orders since feature, which is the number of orders that have elapsed between a customer’s last order and a customer’s most recent order that included the product of interest. The orders since value in my honeycrisp example is 3.

These apples are electric

Orders since is a measure of recency, but only captures the last order in which a user ordered an item. Contra the honeycrisp example, a cluster of recent orders intuitively signals that a user is likely to include a product in their latest order. So for each product in a user’s order history, I create features indicating with a binary whether or not an item was in each preceding order — in the last order, in the 2nd last order, etc.

One of the tricker things to predict in the Instacart dataset is the incidence of orders without reordered products. But plotting the proportion of this incidence across the training sample (a snapshot of 131K+ users) provides some inspiration.

First, looking at the proportion of orders without reordered products by the number of days since a customer’s prior order let’s us know these orders are concentrated in minimum and maximum proximity to a prior order. A plurality of orders are made at 30 days, making the inflection particularly significant

It’s also the case that the proportion of orders without reordered products is far higher for users with fewer orders. And as above, this is significant given that a plurality of users in the dataset have made the minimum number of orders (four, including their order in the training set), and about half of users have made less than 10 orders, where the proportion of orders without reordered products is still amplified.

The proportion of orders with no reordered products is always highest for users with the fewest orders, though particularly high on select days. I try to model the relationship between total orders and days since a prior order as they pertain to orders without reordered products with a feature called prob_not_none. The prob_not_none value for each user is the probability of an order including reordered products. For example, a user making their 4th order 30 days after their previous order would have a value of .82, while a user making their 39th order at the same proximity to their last order would show a .96.

Products with the highest reorder rates

The last feature I’ll cover here is the product reorder rate, which is uniquely derived from all users (as opposed to a subset of users regarding ‘prob_none’ or an individual’s order history for the remaining features.) The reorder rate, meaning the percentage of users who have ordered a product multiple times divided by the total number of customers who have ordered the product, tends to be highest for perishables that people consume quickly and often. Reorder rates are lowest for items that most people need occasionally, but use infrequently, and keep for a long time.

Logistic regression coefficients for model with 10 or more customer orders

To predict the previously purchased grocery items for each customer, I split users into groups of less than 10 orders and 10 or greater orders and run classifiers on each. For a logistic regression fit on a sample of users with 10 or more orders, the number of orders since ordering a product is a particularly strong negative predictor, while the reorder rate for the product and overall order rate are more predictive than other recency measures. For users with less than 10 orders, orders since is a much diminished though still salient negative predictor, and while the reorder rate is likewise the highest weighted positive predictor, the presence of an item in a user’s last order is the penultimate positive predictor.

You can get a pretty good result with these 10 simple features and a single classifier. Good luck!

--

--