A Gentle Introduction on Market Basket Analysis — Association Rules

Susan Li
Towards Data Science
6 min readSep 25, 2017
Source: UofT

Introduction

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

An example of Association Rules

  • Assume there are 100 customers
  • 10 of them bought milk, 8 bought butter and 6 bought both of them.
  • bought milk => bought butter
  • support = P(Milk & Butter) = 6/100 = 0.06
  • confidence = support/P(Butter) = 0.06/0.08 = 0.75
  • lift = confidence/P(Milk) = 0.75/0.10 = 7.5

Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Ok, enough for the theory, let’s get to the code.

The dataset we are using today comes from UCI Machine Learning repository. The dataset is called “Online Retail” and can be found here. It contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online retailer.

Load the packages

library(tidyverse)
library(readxl)
library(knitr)
library(ggplot2)
library(lubridate)
library(arules)
library(arulesViz)
library(plyr)

Data preprocessing and exploring

retail <- read_excel('Online_retail.xlsx')
retail <- retail[complete.cases(retail), ]
retail <- retail %>% mutate(Description = as.factor(Description))
retail <- retail %>% mutate(Country = as.factor(Country))
retail$Date <- as.Date(retail$InvoiceDate)
retail$Time <- format(retail$InvoiceDate,"%H:%M:%S")
retail$InvoiceNo <- as.numeric(as.character(retail$InvoiceNo))
glimpse(retail)

After preprocessing, the dataset includes 406,829 records and 10 fields: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country, Date, Time.

What time do people often purchase online?

In order to find the answer to this question, we need to extract “hour” from the time column.

retail$Time <- as.factor(retail$Time)
a <- hms(as.character(retail$Time))
retail$Time = hour(a)
retail %>%
ggplot(aes(x=Time)) +
geom_histogram(stat="count",fill="indianred")
Figure 1. Shopping time distribution

There is a clear bias between the hour of day and order volume. Most orders happened between 10:00–15:00.

How many items each customer buy?

detach("package:plyr", unload=TRUE)retail %>% 
group_by(InvoiceNo) %>%
summarize(n_items = mean(Quantity)) %>%
ggplot(aes(x=n_items))+
geom_histogram(fill="indianred", bins = 100000) +
geom_rug()+
coord_cartesian(xlim=c(0,80))
Figure 2. Number of items per invoice distribution

People mostly purchased less than 10 items (less than 10 items in each invoice).

Top 10 best sellers

tmp <- retail %>% 
group_by(StockCode, Description) %>%
summarize(count = n()) %>%
arrange(desc(count))
tmp <- head(tmp, n=10)
tmp
tmp %>%
ggplot(aes(x=reorder(Description,count), y=count))+
geom_bar(stat="identity",fill="indian red")+
coord_flip()
Figure 3. Top 10 best sellers

Association rules for online retailer

Before using any rule mining algorithm, we need to transform the data from the data frame format, into transactions such that we have all the items bought together in one row. For example, this is the format we need:

Source: Microsoft
retail_sorted <- retail[order(retail$CustomerID),]
library(plyr)
itemList <- ddply(retail,c("CustomerID","Date"),
function(df1)paste(df1$Description,
collapse = ","))

The function ddply() accepts a data frame, splits it into pieces based on one or more factors, computes on the pieces, and then returns the results as a data frame. We use “,” to separate different items.

We only need item transactions, so remove customerID and Date columns.

itemList$CustomerID <- NULL
itemList$Date <- NULL
colnames(itemList) <- c("items")

Write the data fram to a csv file and check whether our transaction format is correct.

write.csv(itemList,"market_basket.csv", quote = FALSE, row.names = TRUE)

Perfect! Now we have our transaction dataset, and it shows the matrix of items being bought together. We don’t actually see how often they are bought together, and we don’t see rules either. But we are going to find out.

Let’s have a closer look at how many transactions we have and what they are.

tr <- read.transactions('market_basket.csv', format = 'basket', sep=',')
tr
summary(tr)

We see 19,296 transactions, and this is the number of rows as well. There are 7,881 items — remember items are the product descriptions in our original dataset. Transactions here are the collections or subsets of these 7,881 items.

The summary gives us some useful information:

  • density: The percentage of non-empty cells in the sparse matrix. In another words, the total number of items that are purchased divided by the total number of possible items in that matrix. We can calculate how many items were purchased using density like so: 19296 X 7881 X 0.0022
  • The most frequent items should be the same as our results in Figure 3.
  • Looking at the size of the transactions: 2247 transactions were for just 1 item, 1147 transactions for 2 items, all the way up to the biggest transaction: 1 transaction for 420 items. This indicates that most customers buy a small number of items in each transaction.
  • The distribution of the data is right skewed.

Let’s have a look at the item frequency plot, which should be in aligned with Figure 3.

itemFrequencyPlot(tr, topN=20, type='absolute')
Figure 4. A bar plot of the support of the 20 most frequent items bought.

Create some rules

  • We use the Apriori algorithm in Arules library to mine frequent itemsets and association rules. The algorithm employs level-wise search for frequent itemsets.
  • We pass supp=0.001 and conf=0.8 to return all the rules that have a support of at least 0.1% and confidence of at least 80%.
  • We sort the rules by decreasing confidence.
  • Have a look at the summary of the rules.
rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8))
rules <- sort(rules, by='confidence', decreasing = TRUE)
summary(rules)

The summary of the rules gives us some very interesting information:

  • The number of rules: 89,697.
  • The distribution of rules by length: a length of 6 items has the most rules.
  • The summary of quality measures: ranges of support, confidence, and lift.
  • The information on data mining: total data mined, and the minimum parameters we set earlier.

We have 89,697 rules. I don’t want to print them all, so let’s inspect the top 10.

inspect(rules[1:10])

The interpretation is pretty straight forward:

  • 100% customers who bought “WOBBLY CHICKEN” also bought “DECORATION”.
  • 100% customers who bought “BLACK TEA” also bought “SUGAR JAR”.

And plot these top 10 rules.

topRules <- rules[1:10]
plot(topRules)
plot(topRules, method="graph")
plot(topRules, method = "grouped")

Summary

In this post, we have learned how to perform Market Basket Analysis in R and how to interpret the results. If you want to implement them in Python, Mlxtend is a Python library that has an implementation of the Apriori algorithm for this sort of application. You can find an introduction tutorial here.

If you would like the R Markdown file used to make this blog post, you can find here.

reference: R and Data Mining

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Susan Li
Susan Li

Written by Susan Li

Changing the world, one post at a time. Sr Data Scientist, Toronto Canada. https://www.linkedin.com/in/susanli/

Responses (9)