Basket Analysis Introduction | Instacart Dataset

Mert Barbaros
Star Gazers
Published in
12 min readMar 30, 2021

I. PROJECT DEFINITION

This project is aiming to define and categorize behavioral marketing campaigns (BMCs) based on literature search and my professional experience and then creating a framework for assigning these BMCs to customer base in retail industry. For this purpose, I used Instacart market basket analysis dataset which is available in Kaggle [1]. In this context, I will analyze the consumer behavior in e-retailer setting.

Hawkins and Mothersbaugh defined consumer behavior as the study of individuals, groups or organizations and the processes they use to select, secure, use and dispose of products, services, experiences and ideas to satisfy needs and the impacts that these processes have on the consumer and society [2]. As a science, ethology analyses animal behavior in the animal’s natural habitat. In that setting, I will see the consumers natural habitat as an e-commerce store called Instacart. In this natural habitat, Instacart users performed several behavioral actions during the dataset year, 2017. Those behavioral actions will tell us their habits and product usage patterns and according to those patterns we will assign several behavioral campaigns to each user. Behavioral campaigns are campaigns that aiming to create certain actions on the customer based on aim of the company which is Instacart in this project. Finally, I am aiming to create a framework for behavioral campaign creation and assignment for retail industry.

Notes:

You can reach out the Jupyter Notebook on my Github page.

II. PROJECT INTRODUCTION

Main motivation of this project is analyzing customers in behavioral setting and planning marketing efforts to each customer based on their habits and product usage patterns. There are multiple well-known campaign KPIs (Key Performance Indictors) and campaign types in retail industry. However, creating campaigns based on behavioral data for each user is a personalization problem and I will try to explore efficient methods in this project.

I used an open-source data called Instacart Market Basket Analysis Data which is available in Kaggle [1]. Instacart is an online grocery delivery platform which is operating in multiple cities in U.S. Business model of company is delivering grocery items from small and major stores to customers with on-demand delivery model. Instacart created a Kaggle competition in 2018 and shared it’s 2017 order data which includes three million customer orders. For this project, I used this data to perform my analyses. In this project I worked on five tables with following information:

TABLE 1: Initial Features of Instacart 2017 dataset

TABLE 2: Initial Features of Columns of Instacart 2017 dataset

Notice that, dataset does not include any monetary information. For this project, this issue created the limitation of the project. Because of this reason, in this project we will not include monetary analyses. Dataset includes nearly fifty thousand products; however, I will add some prices to each product according to website of Instacart for partial monetary analyses. During the data preprocessing section, you will see data denormalization and generation operations. After those operations, we will increase the number of columns in the dataset.

III. PROJECT LITERATURE

Hawkins and Mothersbaugh defined consumer behavior as the study of individuals, groups or organizations and the processes they use to select, secure, use and dispose of products, services, experiences and ideas to satisfy needs and the impacts that these processes have on the consumer and society [2]. We are facing with behavioral analysis in retail mostly in customer relationship management (CRM) applications and loyalty marketing programs. However, for a retailer, behavioral analysis is an important element for sustainable growth strategy. In this project, I will review the behavioral analysis and marketing campaigns as a part of growth strategies, especially scaling strategies. Growth is a way to tell what a business wants to do in accordance with its goals and
objectives [3]. Companies can grow through scaling (organic growth), market entry, acquisition or innovation. Basically, growth through scaling is a type of organic growth [3] and it describes the efforts of company for increasing its revenue by doing more of the same business the company is already doing without significantly increasing resources or costs [4].

If a company is growing via successful scaling strategy using behavioral marketing campaigns, we should measure the performance accurately. In literature there are different approaches to measure retailer’s performance not limited with the following list:

  1. Recency, Frequency and Monetary (RFM): Recency is number of days since the customer made his/her last purchase. Frequency is number of orders divided by number of customers during the specified period (week, year, month, etc.). Monetary value is the revenue generated by the customer. RFM is the simple measure that multiplication of recency, frequency and monetary value of the customer [6]. Please note that, each of the element of RFM value is key performance indicator for a retailer.
  2. Customer Retention: Measure of the number of customers that a company continuous to do business with over a given period of time [7].

Notice that, I limited the performance indicators based on our dataset. However, each element can be adapted to retailers’ core business channels like mobile application, website, mobile website, product, department, and so on. Another important element of performance indicators is the customer journey. The customer journey defines the life of the customer which starts with the initial purchase and ends with the final order. According to Hillstrom for a vast majority of businesses, customer journey ends following a first purchase. In his book, he defined loyalty clusters based on next purchase probability of the customer for the following year. Based on his research, loyalty A customers have eighty percent or more chance to buy again, loyalty B customers has sixty percent or more, loyalty C customers forty percent or more, and finally zero to twenty percent chance for loyalty F customer. For most businesses, most of the customers will be in cluster F and few of the customers will be cluster A [8]. One of the most challenging operation for retailer is upgrading customers from cluster F to cluster A.

Customers upgrade from one cluster to another via behavioral marketing campaigns. Cuthbertson and Laine behavioral marketing campaigns as loyalty marketing strategies in five distinct categories:

1) Pure Loyalty Strategies: Primarily aimed at existing customers and focus on the retailer’s product and service offer.

2) Push Loyalty Strategies: Primarily aimed at pushing customers towards the retailer and focus on the retail location or channel.

3) Pull Loyalty Strategies: Primarily aimed at pulling customers towards the retailer or particular purchases and focus on retailer promotion.

4) Purchase Loyalty Strategies: Primarily aimed at increasing the number and value of purchase, transactions, regardless of which individual retailer benefits.

5) Purge Loyalty Strategies: Primarily aimed at the retailer purging all unnecessary costs and focus on providing all customers with the lowest possible price. [5]

Cuthbertson and Laine, analyzed the importance of each CRM practices for each loyalty strategy. Following list is the analyzed CRM practices:

For each strategy, relative importance of CRM practices is changing. Pure loyalty strategies are focusing on customer base and not interesting with customer acquisition, on the other hand for the rest of the strategy’s customer acquisition is getting more and more important. We are facing with purchase loyalty strategies in coalition programs like Nectar in U.K., push loyalty strategies is often adopted by retailers where their core product range is largely considered as interchangeable with those of competitors, and the visibility and accessibility of the retail brand is crucial in pushing customers towards the retailer, pull loyalty strategies are often adopted by retailers where their core product range is largely considered as interchangeable with those of competitors, and the promotion of the retail brand, often through association with other brands, is crucial to pull customers towards the retailer, pure loyalty strategies re often adopted by retailers where the physical exchange of product and service with a particular retailer is crucial. The service provided by store staff plays a critical role in such relationships. For example, department stores [5].

In that setting, pull type of loyalty strategy is more appropriate for Instacart because of its business model and tailored promotions are critical for successful growth. We can create tailored campaigns based on purchase value, frequency, recency (for decreasing churn ratio), cross-sell, up-sell constraints. In this project we will apply those type of constraints to assign campaign types to users.

IV. DATA PRE-PROCESSING

I started data pre-processing operations with missing-data handling. In our dataset, in orders table, “days_since_prior_order” column had 206.209 NaN value. There were many NaN values because of that orders were the first order of the customer. This is why I fill those NaN values with zero.

After handling NaN values, I started the denormalization operations on the dataset. In “orders” table, “order_dow” is showing the day number of order placed like zero for Sunday. I created a new column called “order_day” and implemented the actual day name for clear analysis. Similarly, “order_hour_of_day” in the “orders” table is showing the hour of the day that order placed. I segmented the hours like between twelve and eighteen, it is afternoon and created a new column called “day_segment”. After pre-processing operations, you can see the “orders” table first five rows in Table 4:

In products table, there were NaN values in product name column. I droped those NaN values.

V. DATA ANALYSIS & VISUALIZATION

In this part of the project, I performed data visualization on related performance indicators of the project. Frequency is highly important behavioral metric for pull loyalty strategies. It shows number of orders divided by number of customers in a given time period. Before calculating the purchase frequency, it is important to see the number of customers for each purchase frequency.

Figure 1: Number orders vs. Number of Customers
Figure 2: Number of Orders vs. Number of Customer Makes These Orders, Frequency Bins

Data is clearly showing that, in 2017, Instacart users made minimum four purchases and maximum hundred purchases. Like in the Hillstrom’s theory [8], majority of the customers are in the minimum purchase value which is four. Because data is spreading in a long range, it is wise to create bins for number of orders. When we look to the number of orders, it is appropriate to create seven bins like the following: 3, 10, 21, 31, 41, 51, 61, 71, 81, 91, 101

When we turn number of orders into frequency bins, we are seeing a clearer picture about frequency performance of Instacart. 50.68% of the customers are in the 3–10 orders interval and only 8.25% of customers have more than forty orders during the 2017.

In this setting, we can say that company is showing a highly standard retailer profile and needs to perform a behavioral campaign which are aiming to increase number of orders of current customers.

Creating campaigns based on date and time of the day is popular in retail industry. In such campaigns, retailer generally seeking two targets. First one is increasing orders in idle time of e-commerce store and second is increasing the number of orders of customers whose pattern is match with the date and time. For this purpose, I created the day and time analysis of customers purchases Figure 3.

Figure 3: Number of Orders According to Date

Instacart can create time-based campaigns according the days of the week and hours of the day. For instance, night orders are low in every weekday. Also, orders are decreasing when we close to the end of the week like Friday and Saturday.

Recency is the number of days passed since the customer’s last purchase. In our order table, we have a column for that. When we took the maximum number of days since prior order column, we will end up with the maximum number of recencies. When we create a histogram for that:

Figure 4: Maximum Recency vs. Number of Customers

We can clearly see that majority of customers are making their next purchase after 30 days later.

Figure 5: Top 20 Products

Similarly, we can see the most popular aisles

Figure 6: Top 20 Aisles

Both popular products and aisles are from the fruit & vegetable category. Data implicates that, short shelf-life categories are most popular orders due to their lifetime. We can see the following order performances of products, aisles and departments in the following figure.

Figure 7: Top 20 Products, Aisles & Departments

In Figure.7, we can see that Produce & dairy egg departments most performed departments during the 2017. If we look departments closely, we can see the most performing departments:

Figure 8: Departments Performance

%29.2 of orders are coming from Produce category and %16.7 of orders are coming from dairy eggs category. Both categories are including daily consumption products. Instacart has snacks and beverages alternatives for all consumer types (health sensitive or not). It is very clear that, appropriate campaigns can create sales opportunities for the company.

Figure 9: Departments Reorder Pattern

Figure 9 showed that, each departments reorder pattern is related with user’s consumption habits. For example, in snacks, dairy eggs, and bakery department, biweekly consumption is high. However, for each department weekly orders are the most popular category. The figure showed us, with an appropriate campaign types, biweekly customers can gain a weekly consumption habit in each department category. Another important question is, how consumers order behavior is changing according to their sequence of adding products to their cart. In the following table, you will find the reorder pattern of customers based on their add to cart sequence for maximum 20 orders in the basket.

TABLE 5: Reorder pattern based on add to cart sequence

It is very clear that, consumers are adding their urgent/necessary products to basket first, then they are adding products which are not part of their behavior or not in their shopping list. In such scenarios, personalized recommendation systems are handy for marketers.

In this part, I tried to analyze and visualize the consumer behavior and company order performance based on features in the dataset.

I. CONCLUSION

Based on analysis in dataset, I identified several problems and improvement points:

1. According to dataset, company is performing well about number of orders per user. Users made minimum 4 orders during the 2017. However, it is very odd for retail industry. There is no explanation about discarded users in the dataset however in general, we expect to find %30–50 one-timer consumers in such datasets. However, %50 of consumers made between 3 and 10 orders in 2017, %27 of consumers made between 11 and 20. It shows that, company can create activation marketing campaigns based on product recommendation algorithms and decrease the share of 3–10 orders customers.

2. Time limited campaigns can increase the number of orders. According to dataset, night and evening hours order share is small when we compared with other hours during the 2017. Based on profitability analysis, company can create time limited campaigns for such hours.

3. Most performed product and department category is daily products. Those products and other products are also having greatest share in weekly reorder pattern. Company can create a time limited campaigns for consumers who made monthly purchases on such categories.

4. According to dataset, consumers are adding their urgent/necessary products to their carts first, and then they start to look around for other products. Greater recommendation performance can increase number of orders in the long-term.

Due to dataset does not include any revenue feature, it is impossible to evaluate user base analysis and create customer upgrade or ticket price upgrade campaigns based on this data. However, this work can be extended with clustering algorithms where we can match the consumers with appropriate behavioral campaigns with increase the number of orders and change the consumer behavior according to our agenda.

Further Readings & References

[1] Instacart, “Instacart Market Basket Analysis (dataset, accessed January, 2021),” in Kaggle website, 2017

[2] D. I. Hawkins, D. L. Mothersbaugh, “Consumer Behavior Building Marketing Strategy (Book),” 17th ed., McGraw-Hill, 2010, pp. 6–7

[3] A. Ilhan, Y. Durmaz, “Growth Strategies In Businesses and A Theoretical Approach (Article),” 2015

[4] University of Virginia, Coursera, “Business Growth Strategy (Online Course, accessed October 2020),” 2019

[5] R. Cuthbertson, A. Laine, “The Role of CRM Within Retail Loyalty Marketing (Article),” Journal of Targeting, Measurement and Analysis for Marketing Vol. 12, 3, 2003, pp. 290–304

[6] H. Kim, Y. Kim, “A CRM performance measurement framework: Its development process and application (Article),” Industrial Marketing Management 38, 2009 pp. 477–489

[7] B. Baer, “How to Calculate Your Customer Retention Rate (Article), Zendesk The Library Website, 2020

[8] K. Hillstroms, “Hillstrom’s Loyalty: Measuring Why it is So Hard to Grow a Business via Loyal Customers (Book),” CreateSpace Independent Publishing Platform, 2015, pp. 7–45

--

--