Instacart Market Basket Analysis: What is in your basket apart from Bananas?

Or bread for that matter? In this blog post, we aim to find what products do customers buy together and use a data mining technique called Association rule mining to find which product B customer will buy given he buys product A?

Without a ML model, we can say with 95% confidence that 99% of the readers here have got recommendations on Amazon. Haven’t you?

Association rule mining helps derive purchasing patterns of customers market basket in terms of what products they usually buy together. In this blog post we try to answer the following questions on a customer’s product purchase patterns.

  • What are the most popular product categories in terms of # of orders?
  • When are most of the orders placed i.e. when are users most active in making in a purchase?
  • What are the most frequently brought in pairs i.e. which two products are bought most often in customer purchases?

We analysed the Instacart (online grocery store)Market Basket Analysis competition data which is a relational set of files describing customers’ orders over time.

Let’s look at the entire product portfolio to choose from for a customer:

Product Portfolio (Size proportional to number of products in each aisle and department)

Personal care, snacks and pantry are the departments with highest number of products on offer with vitamins supplements, candy chocolate and spices seasonings the aisles with highest number of products in each department respectively.

What are the top departments and aisles by number of orders?

Product Portfolio (Size proportional to number of orders of products each aisle and department)

When does a customer do all his shopping?

Number of orders by day of the week
Number of orders by hour of the day
Number of orders by day of the week and hour of the day
  • Orders pick up beginning at 8 am people start to wake up
  • More orders are placed on weekends (day 0 and 1) compared to the rest given customers usually do their grocery shopping on weekends
  • Orders continue to pick up rapidly in the morning starting at 10 am
  • Orders continue to be made during the morning, experiencing a small dip at noon and begin to drop commencing at 4 pm
  • 10 am to 4 pm are the busiest hours of the day in terms of number of orders received

What are the most frequently brought products?

We want to know what the two products most frequently brought together for which we will use Association rule mining.

Association Rules Mining:
Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}. One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:

support
This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total and {apple,egg} occurs in 3 of them, so:

support{apple,egg} = 3/5 or 60%

The minimum support threshold required by apriori can be set based on knowledge of your domain. In this grocery dataset for example, since there could be thousands of distinct items and an order can contain only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.

confidence
Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that item A was purchased. This is expressed as:

confidence{A->B} = support{A,B} / support{A}

Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 indicates that B is always purchased whenever A is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that item A is purchased, given that item B was purchased:

confidence{B->A} = support{A,B} / support{B}

In our example, the percentage of times that egg is purchased, given that apple was purchased is:

confidence{apple->egg} = support{apple,egg} / support{apple}
= (3/5) / (4/5)
= 0.75 or 75%

lift
Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items are occuring together in the same orders simply by chance (ie: at random). Unlike the confidence metric whose value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}:

lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})

In our example, we compute lift as follows:

lift{apple,egg} = lift{egg,apple} = support{apple,egg} / (support{apple} * support{egg}) = (3/5) / (4/5 * 3/5) = 1.25

In summary, lift can take on the following values:

  • lift = 1 implies no relationship between A and B. (ie: A and B occur together only by chance)
  • lift > 1 implies that there is a positive relationship between A and B. (ie: A and B occur together more often than random)
  • lift < 1 implies that there is a negative relationship between A and B. (ie: A and B occur together less often than random)

In our example, apple and egg occur together 1.25 times more than random, so we conclude that there exists a positive relationship between them.

What are the most frequently brought in pairs i.e. which two products are bought most often in customer purchases?

Most frequently bought items
  • The top item pairs bought together contains Bananas meaning customers would always buy another fruit with bananas
  • A purchase of Avocado is also accompanied with a purchase of vegetable/fruit like banana or spinach
Item Pairs with lift>3
  • Organic Strawberry Cottage cheese and Organic cottage cheese blueberry have the highest lift as people tend to buy different flavors of the same product
  • This trend is shown in item pairs with Unsweetened Whole Milk products

Conclusion

We see that the top associations are not surprising, with one flavor of an item being purchased with another flavor from the same item family (eg: Strawberry Chia Cottage Cheese with Blueberry Acai Cottage Cheese, Chicken Cat Food with Turkey Cat Food, etc).

As mentioned, one common application of association rules mining is in the domain of recommender systems. Once item pairs have been identified as having positive relationship, recommendations can be made to customers in order to increase sales. And hopefully, along the way, also introduce customers to items they never would have tried before or even imagined existed!

--

--