Making Sense of 33 Million Rows of Instacart Data: Exploratory Data Analysis Using Plotly

Published in

Geek Culture

6 min readAug 22, 2021

Instacart is one of the hottest Silicon Valley startups of the last decade and is now valued at whopping $39 billion. In 2017 they hosted a Kaggle competition challenging data scientists to predict which products will be reordered for a given user. For this purpose they open sourced a dataset comprising anonymised information about 3 million+ orders from their 200,000+ users.

The first step in understanding data is to perform exploratory data analysis on the dataset and have a cursory insight of it. We shall use python’s plotly library for data visualisation and also datapane library to make our plotly charts interactive at some places here.

The dataset in question can be downloaded from Kaggle website. It contains 6 CSV files holding data pertaining to orders, products, departments and aisles.

Please note that I chopped off parts of plotly code in github gists that is related to embellishing the charts. This is to reduce the size of this page (it is already too long). Complete code can be found on my github.

Importing the required libraries:

Loading the files as pandas dataframes:

Finding out the shapes of dataframes:

Printing out first few rows of the dataframes gives us an idea of the contents inside:

We can see orders_df contains data about user orders between 4 and 100. The last order of each user is split into ‘train’ and ‘test’ sets for the purpose of predicting reorders and the rest into ‘prior’ set.Detailed information about orders like individual products in an order basket are given in op_train_df and op_prior_df. And information related to aisles, departments and products is furnished in separate dataframes.

Order Count:

Let’s find out the number of orders by day of week and hour of the day.

It is evident from the above charts that days 0 and 1 (surprisingly no information on what days they are) have higher number of orders than other days of week. We may assume they are weekends — Saturdays and Sundays.
Most of the orders are placed between 9 AM and 5 PM (office hours?).

We can combine the above data and get even clearer idea about the order timings. This is achieved by using groupby, aggregate and pivot functions of pandas on orders_df.

The busiest hours for Instacart are 1 PM — 4 PM on Days 0 (Saturdays?) and 10 AM — 11 AM on Days 1 (Sundays?).

Let us now pit departments against day of the week and see how order count varies.

It is apparent that dairy eggs and produce are busy throughout the week.

Orders Per Customer:

We are given between 4 and 100 orders of users in the dataset. Let’s plot it and see how the distribution looks like.

Majority of the users have placed 10-20 orders

Days Since Prior Order:

In orders_df, information on ‘days since prior order’ for a given order and user is given. Let’s plot it and see how the distribution looks like.

In the above chart, we can see peaks at 7 and 30. This indicates most grocery purchases are made on weekly and monthly basis.
Another peak at 14 means significant number of purchases are on fortnightly basis.

Train/Test/Prior Set:

As already mentioned, in orders_df, orders (total — 3,421,083) are categorised into train, test and prior brackets. Let us find out the exact breakup details.

Top-Selling Products, Aisles & Departments:

Now we concatenate op_train_df and op_prior_df and then ‘inner’ merge the resulting dataframe with products_df, aisles_df and departments_df. The resulting dataframe will have 33 million+ rows and 9 columns. Please note that orders which were categorised into ‘test’ set don’t have information regarding the individual products sold in them. So those numbers won’t reflect in the below calculations we perform.

Let us look at top selling products and their aisles and departments.

A word cloud of top 50 products:

**Top Selling Products** — Bigger the font, more the order count

Fresh fruits and vegetables are top selling products and interesting pattern is most of them are organic. The aisles and departments charts reflect the same trend.

No. of Products in Order Basket:

Next we shall calculate the number of products in order baskets. First ‘order_id’ and ‘add_to_cart_order’ are grouped together and the highest value of ‘add_to_cart_order’ for a given order_id ,which is equal to total number of products in a given order basket, is picked using pandas aggregate function. This distribution is plotted below.

It can be inferred that 4–7 products are the most common in order baskets and peaks at 5 products.

Re-ordering Patterns:

Now we look at re-ordering patterns. A very interesting insight is how ‘add-to-cart-number’ impacts whether a product will be reordered or not. Find that out below.

Milk, water, fresh fruits and eggs have >70% chance of re-ordering.

The usual suspects for high re-order percentage are unsurprisingly departments of dairy eggs, beverages and produce.
Personal care features at the last when it comes to re-ordering. This is no-brainer since cosmetics are very less likely to purchased often compared to food items.

Products which are initially put in the order basket are more likely to reordered. This is pretty obvious for us.
An interesting pattern we see is products which are put in the order basket at the end also have high reorder %. This may mean those products are placed on the way to checkout counters or near them (Kinder Joy, Dairy Milk, Snickers etc. can be spotted right adjacent to checkout counters.)

Re-order ratio doesn’t seem to vary much with day of the week.

Even with hour of the day, re-order % doesn’t vary much. Slight peak during 6 AM to 8 PM indicates products might be from dairy eggs department.

On an average 59% of items in an order basket are reordered ones.
12% of orders don’t have any reordered products. This also means 88% of orders have reordered products.

Complete code for the above analysis can be found on my github.