Instacart: A Not-So Insta-Analysis

Madison John
Madison John
Published in
9 min readJun 23, 2020
Image from Pexels

Introduction

For this project, which uses both supervised and unsupervised learning algorithms on a singular dataset, I wanted to analyze data that was COVID-19 adjacent. In this time of social distancing, alternatives to the traditional grocery experience has come to the forefront of national attention. Instacart provides such an alternative.

What is Instacart?

Instacart was founded in 2012 by Apoorva Mehta as a grocery fulfillment and delivery service.

Users place their orders via a website or mobile application on Android and iOS. Personal shoppers then go to the store of choice, purchase the items on the order list, and deliver the goods to users’ doorsteps.

Instacart currently has partnerships with over 350 retailers throughout 5,500 cities across the United States and Canada. This translates to over 25,000 grocery stores that include such well-known chains as Albertson, Costco, HEB, Kroger, Target, and many more.

Project Goals

The main goal of this project is to determine whether machine learning techniques can be applied to the Instacart dataset to improve customer retention and increase product usage.

This can be accomplished by learning more about the users themselves:

  • Can the users be divided into groups of users with similar characteristics?
  • Can we make predictions on future purchases based on ordering history?

Dataset

The data used for this project is from the 2017 competition that Instacart hosted on Kaggle.com. The Instacart Market Basket Analysis data set consists of several csv files:

  1. aisles.csv
  2. departments.csv
  3. orders.csv
  4. products.csv
  5. order_products_*_.csv (where * = [prior, train])

Note: While all work in this project was completed on data from 2017, the process can be repeated on any similarly collected and organized data.

Model Preparation

Photo by Anna Shvets from Pexels

Dataset Inspection

Number of Unique Values for Each ID Variable

Above is a table showing the number of unique values for each of the identifier variables (aisle_id, department_id, user_id, etc.).

We will aggregate the data several times in the upcoming sections so we do not have to work with millions of observations. I am doing this work on my mid-tier home PC after all.

Dataset Transformations

Transforming From CSVs into Dataframes

The above graphic shows how information from each of the csv files was combined into three dataframes to better associate orders, users, and products ordered.

Additionally, the products ordered were grouped into the departments from which they came and aggregated as sums. This reduced the observation count from ~32 million to ~3.4 million.

Finally, a new variable num_items was added to record the total number of items ordered per order.

Data Visualization

Now let’s take a look at some pretty pictures to learn a little bit about the data.

Top 10 Aisles by Number of Products

Above are the ten aisles with the most number of products. Good to see vitamins and supplements round out the top three after candy, chocolate, and ice cream.

Top 10 Departments by Number of Products

Above shows the ten departments with the most number of products.

Top 10 Products Ordered

Above are the ten products ordered most often. Fruits and vegetables are quite popular with Instacart users.

Top 10 Departments by Products Ordered

Now we have the ten departments with the most number of products ordered. As alluded to in the previous graphic and confirmed here, produce items are the most ordered on Instacart.

Now let us start looking at the orders with respect to time.

Number of Orders by Day of Week

The above graphic shows that during the weekends, specifically on Saturdays and Sundays, are when the greatest number of orders are placed. The number of orders decreases toward the middle of the workweek, with Wednesday having the least number of orders.

Number of Orders by Hour of Day

The plot above indicates there is a peak ordering period between 10 AM and 4 PM, with the number of orders decreasing gradually toward midnight and a dead period following that until 6 AM.

Number of Days Elapsed Since Previous Order

Looking at the order lag time, or the number of days elapsed since the previous order, we can note several groups:

  • monthly shoppers
  • shoppers every ~3 weeks
  • shoppers every ~2 weeks
  • weekly shoppers
  • shoppers every 3–4 days
Number of Orders by Order Number

Above, we see that all users in the dataset placed at least four orders. After that, there is a marked decline.

Number of Orders by Number of Items Ordered/Re-Ordered

Finally, we have the distribution of the number of items ordered and re-ordered. Note the long tails to the right from the peaks of five and two items ordered and re-ordered respectively.

Clustering

Now that we have visualized the data, let’s get down to the meat and potatoes of this project: applying machine learning techniques.

Photo by Giftpundits.com from Pexels

The goal of clustering is to segment the users into groups using the order information to identify similarities between customers within the same cluster.

Model Definition

First, we needed to define the model. Below are the steps taken.

  1. Group the data by user_id, reducing the observation count further from ~3.2 million to ~206 thousand.
  2. Aggregate the features as shown in the table below:
Feature Aggregation

Feature Transformation

Next, due to the issues indicated below, we transformed the features by executing dimensionality reduction to diminish the impact of feature sparsity and log transformation to approximate a normal distribution.

Feature Transformation

Algorithm Selection

Of the algorithms below, k-means resulted in the greatest similarity score. The Elbow Method and silhouette analysis determined that four clusters was optimal.

  • hierarchical (agglomerative) clustering
  • Gaussian mixture model
  • DBSCAN
  • k-means
Elbow Method & Silhouette Analysis Plots

Cluster Evaluation

Now that we have our clusters, let’s compare and evaluate them.

Number of Items Ordered & Re-Ordered by Cluster

The charts above show that Cluster 1 users ordered and re-ordered the fewest number of items, while Cluster 2 ordered and re-ordered the greatest.

Note also that Cluster 0 and Cluster 2 have tighter densities relative to the other two clusters.

Mean Order Lag by Cluster

Here we see that while Cluster 1 users ordered the fewest number of items, they had a wide range in lag time. Cluster 2 users, on the other hand, had a wide range of number of items ordered but the least lag time.

Again, Cluster 0 and Cluster 2 have tighter densities relative to the other two clusters.

Number of Orders vs Mean Number of Items Ordered by Cluster

Cluster 1 and Cluster 3 users placed the fewest number of orders, though Cluster 3 users ordered more items.

Cluster 2 placed the greatest number of orders, though there is two densities of users within this cluster:

  • users that placed ~100 orders
  • users that placed ~30-~80 orders.

Note: When tried, the 5-cluster solution did not separate Cluster 2 users into the two densities shown above. Instead, it added a 5th cluster overlapping Cluster 0 and Cluster 3.

Number of Orders vs Mean Order Lag by Cluster

Finally, we have the number of orders plotted against the lag time. As pointed out in previous density plots, Cluster 0 and Cluster 2 have tighter densities relative to the other clusters. More importantly, the tighter densities indicate stronger relationships between the various features.

So what does this all mean? The table below summarizes the findings based on the various cluster plots we have reviewed.

Summary of Clusters

Predictions

Now that we have segmented our user population, it is time to make some predictions based on the order information. Can we accurately predict the order frequency or lag time of each customer? Let’s find out.

Model Definition

As with clustering, the first thing we do is define our model. The steps below should be familiar.

  1. Group the data by user_id, reducing the observation count further from ~3.2 million to ~206 thousand.
  2. Aggregate the features as shown in the table
  3. Select days_elapsed as the output variable
Feature Aggregation

Since the output variable is a continuous variable, we will be doing a regression.

Feature Transformation

Since we are dealing with the same features, we need to do the same transformations to address the issues we found earlier.

Feature Transformation

Note: Experiments showed that executing PCA did not significantly impact the regression results. Due to this, PCA was skipped for the final analysis.

Algorithm Selection

After testing the predictive algorithms below on a subset of the data, the K-Nearest-Neighbor algorithm was selected for final analysis as it resulted in the lowest error.

  • Random Forest
  • Gradient Boosting
  • Support Vector Machine
  • KNN
Algorithm Selection Metrics

In the first plot above, the standard deviation across folds of a 10-fold cross-validation run for each algorithm decreases with sample size. At 50% sampling, the SVM algorithm has the smallest value at 1.25% while the other three algorithms converge to ~1.9%.

The bar plots show that the mean absolute error (MAE) and the root mean squared error (RMSE) are consistent across sample size; however, the KNN algorithm has consistently lower error.

Feature Importance

Even though KNN will determine final predictions, the Random Forest and Gradient Boosting implementations in sklearn include an attribute that returns the most impactful features.

Random Forest Feature Importance
Gradient Boosting Feature Importance

Both algorithms consider d4 (produce), d16 (dairy/eggs) and the number of items to be the most influential features. Note that Instacart users ordered the greatest number of items from the produce and dairy/egg departments.

Error Evaluation

Next, we take a look at the prediction errors.

In summary, Cluster 0 and Cluster 2 predictions have lower absolute errors than Cluster 1 and Cluster 3. This is presented in the charts below as the width of the clusters in each plot.

Additionally, the greater the number of items ordered or the greater the number of orders placed, the smaller the magnitude of the error will be.

Residual Analysis of KNN Prediction Results
Photo by Gustavo Fring from Pexels

Conclusions

Findings

It all boils down to this.

  • What have learned about Instacart users?
  • What can we do to address our findings?
Summary of Findings and Recommendations

Future Work

The following is a grocery list of items to try in order to improve the similarity score of our clusters and/or to reduce the error in our predictions.

  • Address the sparsity in the features more effectively. Though we aggregated the products ordered into their respective departments, clustering required further reduction with PCA. Keeping all 21 departments as features resulted in clusters with very low to negative similarity scores.
  • Repeat clustering and regression analysis on Cluster 3. Perhaps by reducing the observations to just the priority cluster, we may be able to further optimize the model without worsening compute and time requirements.
  • Use products or aisles as features instead of departments. Sparsity should be addressed first, as using either of these two variables as features will increase the sparsity. There are almost 50,000 products and 134 aisles. Using either as features should provide more granularity into what users are most interested in ordering.

--

--

Madison John
Madison John

husband. father. enginerd. not necessarily in that order.