Predicting Instacart Customer Purchases and Analyzing The Output

Dimitri Linde
Aug 9, 2017 · 6 min read

This post is the third of three in my series about Instacart’s recently released “Online Grocery Shopping Dataset 2017.” In the first post, I explored the dataset, which includes 3.2 million orders across more than 206,000 users. Next, I derived features from the dataset to predict the contents of each customer’s next order. In this post, I analyze the output of my predictive model. I check the model performance at the product-level: what types of products the model correctly identified, what types of products it overestimated (in other words, frequently guessed incorrectly), and what types of products it underestimated (in other words, largely missed.) I also check to see how the model predicts orders without reordered products as well as how the model performs with respect to different customer and order attributes.


Remembering back to the previous post, I first grouped users by products in their order history and then added a number of features to each product. Before analyzing the model output, I’d like to briefly run through these features, which should, after all, shape the result:

  • The number of times a user has ordered a specific item;
  • The order rate: the number of times a user has ordered a specific item divided by the total number of orders the user has made;
  • The number of orders since a user has ordered an item;
  • For each of the past 5 orders a user has made, whether the product was included (e.g. each user’s last order is one feature, 2nd to last order another feature, and so on);
  • The reorder rate, meaning the percentage of users who have ordered a product multiple times divided by the total number of customers who have ordered the product; and
  • The probability of an order including reordered products. A certain portion of orders do no contain any of the user’s previously ordered products, though these ‘None’ orders are concentrated in users with few orders who order in minimum (between 0–2 days) or maximum (30 days) proximity to their last order.

In addition to the above features, I created a new product, ‘None,’ to track the incidence of customer orders without reordered products. Such orders can occur during any order except a customer’s first order.

To score my model I first split users into groups of less than 10 prior orders and 10 or greater orders and then ran classifiers on 15,000 users in each model. The cost metric was an F1 score, a weighted average of precision (true positives / (true positives + false positives)) and recall (true positives / (true positives + false negatives)). In both cases, logistic regression was the best performing classifier, yielding f1 scores (according to the local validation function I leveraged) of:

  • .419 for the model with less than 10 orders per customer; and
  • .4224 for the model with 10 or greater orders per customer.

The F1 scores mask significant differences in the precision and recall associated with each model. While both models are stronger in recall at the probability threshold I set (in which all products above the threshold are included in the prediction), the model for users with less than 10 orders has a starker split (.510/.356) than the model for users with 10 or greater orders (.465/.387).

None Handling

For the model including customers with less than 10 orders, accounting for about half of the users in the sample but representing a substantial majority of orders without reordered products, the precision (.298) and recall (.345) in such cases (where ‘None’ is the predicted or actual product) is worse than for the model as a whole. However, the results are a vast improvement from earlier model iterations, where I was achieving ~.135 recall of ‘None’ orders. The improvement to recall comes at some cost to precision, declining from .36 to .298, and is probably attributable to:

  • creating the probability of a product not being a ‘None’ order feature;
  • treating ‘None’ instances as a product; and
  • intermixing ‘None’ with product predictions in cases where both ‘None’ and products had a predicted probability above a certain threshold of being in a customer’s next order.

‘None’ order prediction is consistent across the range of days since a prior order as well as the total number of user orders. Precision (.207) and recall (.256) for ‘None’ instances are noticeably lower for users with 10 or more orders, though such instances are about 1/3 as common as for the group of users with fewer orders.

Department Analysis

Looking first at the model with users with less than 10 orders, recall is highest in the three largest volume departments: produce, dairy and eggs, and beverages. Recall in each of these departments surpasses .6. These departments generally supply perishable products that people order frequently and consume quickly, aligning well with the features in the model.

Turning to where the model performs poorly, the major failure is in the recall of non-perishable products in departments like pantry, personal care, and household, with recall less than .3 across 6 departments. Products like spices, conditioner, and tile cleaner are not purchased frequently and often times not purchased again. They also keep around the house for a while. The features in the model are not a good fit for predicting the recurrence of these products. Though lower volume departments, substantial gains in recall could be made by improving on the model’s performance in them.

Poor precision is a story of similar departments, though nearly every department, including those with low recall, score above .3 precision.

Though the feature coefficients for each model differ, the features are the same for both, and the model’s strengths and weaknesses are consistent. The model with customers with 10 or more orders tends to be a bit more precise at the expense of an equivalent amount of recall, but the departments with high as well as low precision and recall are the same as above.

By Number of Orders and Days Since Prior Order

As with the ‘None’ cases, the model performs fairly consistently regardless of the total number of user orders.

For users with less than 10 orders

Looking at precision and recall by days since prior reorder, however, scores are markedly lower for orders made 1–3 days after a previous order than any other date. This is an odd but consistent finding, occurring for all types of users and meriting further investigation.

For users with 10 or greater orders

Conclusion

The final iteration of my model excels in recalling products that are ordered often and consumed quickly. It performs poorly with products not meeting this criteria. The above residual analysis demonstrates clear area for continued model improvement in identifying orders not containing reordered products, which EDA previously revealed are concentrated in users with few orders and especially at maximally near or far proximity to a previous order. Residual analysis lastly reveals that the model performs poorly for users making orders between 1–3 day following their last order.

To address the problem of identifying infrequently ordered products, I’ll need to derive additional features. One option would be to try to discover patterns for some infrequently ordered items. Butter, for example, might consistently be reordered for many users after a fixed number of orders or days. Another option would be to model special cases, including ‘None’ instances and orders made 1–3 days after their previous order, separately.

That’s all. Thank you reading!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade