CODEX

Creating a Multifaceted Grocery Recommender System

Published in

CodeX

5 min readMar 11, 2021

For this project I created a multifaceted grocery recommender system based on the Instacart data used in the Kaggle competition of 2017.

Being a lifelong “foodie” with a marketing education, I was immediately drawn to the idea of creating a grocery recommendation system while going through my data science coursework. I regularly receive coupons in the mail from my local grocery store that are specific to either items I have purchased in the past or items that the store thinks “I would be interested in”. I also get emails from the local drug store offering coupons and letting me know when “an item I might like” has gone on sale. These are examples of recommendation systems in action.

The Data

Useful information can be derived from just exploring the purchasing patterns in the data. We can see which aisles and departments are ordered from the most and even down to the product level. We can see the typical number of items in each order and how many days users go before their next order. As an example, the most ordered products can be seen below.

Customer Clustering

My first modeling step was to use the K-Means Clustering algorithm to cluster my users together based on similarity. In a business setting, having this information would allow you to personalize marketing efforts to different clusters of users. One cluster seemed to be defined by large purchases of baby products. Another cluster made a lot of alcohol purchases. Some were large and some very small but specific. I felt it was important to look at each cluster’s “buying power” which I defined by the number of users in the cluster, what percentage of the orders and products were from that cluster, and other purchase metrics seen below.

NLP Product Search Engine

The size of the data, with over 32 million order id and product id combinations, was prohibitive from using a memory-based recommender, such as KNN, that would look at customer to customer similarities and item to item similarities. However, the idea of comparing similar items drove me to use Natural Language Processing to create a search engine in which one could enter any text value and get recommended products. Specifically, I tokenized and stemmed the aisle, department, and product name for each product, then using Count Vectorizer calculated a matrix of the cosine similarity for each product. Example search output can be seem below.

SVD Recommender

Next I used Singular Value Decomposition, which is a matrix factorization method that finds the latent features of the customers and items while reducing the dimensionality of the data, to generate product “ratings” for each user. In the absence of explicit product ratings, I used the number of times a user purchased a particular product as a proxy for a rating, giving me a rating scale of 1–100. My initial RMSE was 3.46 which didn’t seem like a large error on a scale of 1–100 however upon further inspection I realized the the items with the higher ratings had very large prediction errors.

So I rescaled my rating to be on a scale of 1–5 and, along with a hyperparameter grid search, got my RMSE down to 1.26. However, a large popularity bias in the data was causing the most popular items (i.e. Bananas) to be highly recommended for each user. After some research, I decided to perform a personalized re-ranking of the recommended products for each user by designating a percentage of the recommended products to come from the “long or distant tail” of the distribution of products. I defined the “short head” of the products to be the top 15% of products which accounted for the top 80% of orders.

For a new customer, I created a function to allow them to submit ratings for a certain number of products (from a certain aisle if so desired) and then the model would generate a desired number of recommendations (also from a specific aisle if desired), and would provide a specified percentage of the products from the long tail of the distribution.

Association Rules — Market Basket Analysis

I found some great blogs here on Medium from other data scientists that created association rules for the products in this data. These are created by calculating the frequency and support for each product and product combination, as well the confidence that product B will be purchased with product A, and then the lift for the product combination, which is basically how often the products were purchased together divided by the probability of them occurring together if they are independent. Mostly reusing their code (which I credit in the notebook) I generated product association rules for each pair of products purchased together in each cluster of my users. Below are the item combinations with the highest lift scores for cluster number 19.

I appreciated that the way these calculations are made accounts for popularity bias in the data. So even though bananas are purchased with almost everything, their confidence and lift scores are discounted based on their relative frequency.

FLASK application

Finally, I created a FLASK application to deploy my modeling to a web browser. This app has a home page as well as separate pages for the NLP and SVD recommenders. There is also a ratings page for a new user to submit ratings and then receive recommendations from the SVD model. I hope to add the association rules recommender as a separate page in the future.

Thanks for Reading!

View the Github here
Connect with me on LinkedIn here
I am currently open to employment!