Association Rules: Unsupervised Learning in Retail

10 min readMar 25, 2020

Introduction to Association Rules

Association rule is unsupervised learning where algorithm tries to learn without a teacher as data are not labelled. Association rule is descriptive not the predictive method, generally used to discover interesting relationship hidden in large datasets. The relationship are usually represented in form of rules or frequent itemsets.

Association rules mining are used to identify new and interesting insights between different objects in a set, frequent pattern in transactional data or any sort of relational database. They are commonly used for Market Basket Analysis (which items are bought together), Customer clustering in Retail (Which stores people tend to visit together), Price Bundling, Assortment Decisions, Cross Selling and others. This can be considered advanced form of what if scenario, if this then that.

2. How Association Rule Works

There are few key terms that we need to be familiar with to understand how the association rules work.

Apriori: One of the original and oldest algorithm used for building association rules. We will be using Apriori for building all the rules in this blog.

Itemsets: It refers to the collection of items. N item set means set of n items. Simply, it is the set of item purchased by customers.

Support: It is percentage of time X and Y occur together out of all transaction.

((Frequency of X and Y) / (Total # of records))

Confidence: It is defined as measure of certainty associated with each discovered rule. It is percent of transactions that contains both X and Y out of all transaction that contains X

(Frequency of X and Y) / (Frequency of X)

Lift: It is measure of how X and Y are related rather than coincidentally happening together. It measures how many times more often X and Y occur together then expected if they are statistically independent to each other. This measure will be our main focus when evaluating the algorithm results.

Lift (X => Y) = Confidence(X => Y) / Support(Y)

Minlen: the minimum number of items in the rule

Maxlen: the maximum number of items in the rule

Target: indicates the type of association mined

Frequent Itemsets Generation: Find the most frequent itemsets from the data based on predetermined support and minimum item and maximum item

Rule Generation: This step involves generating all the rules from frequent item sets. We can control the number of rules generated by controlling support, confidence or lift.

LHS > RHS: Left hand side and Right-hand side are usually used to understand how often item A and item B occur together. If we are trying to understand how often people go to store A after going to store B. Store A would be LHS and store B would be RHS. Similarly, If we are trying to understand which stores people usually go to before going to store A, Store A would be on RHS and other stores would be on LHS.

3. Real World Problem

In this blog, I will be using a shopping mall dataset with the goal of finding different interesting shopping behavior and understanding association between different stores. Some of the questions we are looking to answer are; which cluster of stores people tend to visit together, which stores people usually go after going to store A or before going to store A or alongside store A.

The dataset comes from an Australian Shopping mall . All the analysis will be performed using R. Let’s get started.

# Load all the required libraries
library(arules)
library(arulesViz)
library(visNetwork)
library(igraph)
library(data.table)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(plyr)
library(dplyr)
library(RColorBrewer)

Arules library contains the Apriori algorithm and arulesViz contains graphing library. All others are just supporting and specific analysis libraries. Most of the data set are transactional datasets,so first thing we need to do is convert transaction data into basket format data.

# Load the data and see the structure
retail <- read.csv("List.csv")
str(retail)

We see that it’s a pretty big data with 2.4 million records and 2 observations. ID column is hashed column which identifies unique customer. And Store name column contains the stores those people visited and spent at least 7 or more minutes in chronological order. 7 minutes threshold is arbitrary and was used to filter out the people who were passerby as the goal is to understand the behavior of true customers .The dataset contains customer flow information for few months. Let’s see first few column.

If we look at first 2 rows, we see that ID 6ea was captured at Pickle Barrel and at Honey. It looks like table is in tall format. So, we need to do some cleaning. In the steps below, we will group the data by ID column and combine all the stores for that ID in a single row, so we will have all the stores visited by that ID in single row.

transactionData <- ddply(
  retail, c("ï..ID"),
  function(df1) {
    paste(df1$Store_Name,
      collapse = ","
    )
  }
)

Now, lets drop ID column and rename store names to tenants.

# set column ID of dataframe transactionData
transactionData$ï..ID <- NULL
# Rename column to tenants
colnames(transactionData) <- c("tenants")
# Show Dataframe transactionData
transactionData

Finally, we have all the stores visited by unique IDs into single row, now our data is ready for association rule mining. But first, let’s save this as csv.

write.csv(transactionData, "tenantfinalmedium.csv", quote = FALSE, row.names = FALSE)

4. Association Rule Mining

Lets’ start by loading the file we just saved, i.e tenantfinalmedium csv file and looking at the data summary.

# Now lets read the data
tr <- read.transactions("tenantfinalmedium.csv", format = "basket", sep = ",")
summary(tr)

The summary functions gives us important information about the dataset. We can quickly see that there are 826,568 rows or unique IDs and total of 33,545 stores names (columns) are in the dataset. Density tells us about percentage of non-zero cells in a sparse matrix. It can be explained as total number of stores that were visited divided by total number of stores in that matrix. 826,568*33,545* .00007026528= 1,948,262 total stores were visited during the time frame of dataset.

Let’s inspect few rows of the data.

# Lets inspect few rows from 150 to 160
inspect(tr[150:160])

Everything looks good. Now, let’s create frequency plots to see the top stores generally visited by people.

# Absolute Frequency Plot
itemFrequencyPlot(tr, topN = 25, type = "absolute", col = brewer.pal(8, "Pastel2"), main = "Absolute Item Frequency Plot")

Frequency plot can be plotted either using absolute count or relative percentage. If we use absolute count, it will plot numeric frequencies of each item independently as in the figure above where as if we plot relative frequency, it will plot how many times these items have appeared compared to others as in the figure below.

# Relative Frequency PlotitemFrequencyPlot(tr,
  topN = 25, type = "relative", col = brewer.pal(8, "Pastel2"),
  main = "Relative Item Frequency Plot"
)

The Cheesecake Factory and LensCrafter seems to be two of the most visited stores. As you can see, it is very easy to see top 25 most visited stores and we can do same for any number of stores. It’s time now to build some association rules and get some deeper insights on how people visit different stores.

# Generating Rules
association.rules <- apriori(tr, parameter = list(supp = 0.001, conf = 0.2, maxlen = 10))summary(association.rules)

As we been through all the parameters used while generating rules in the beginning section of this blog, this section will be focused on rules generation.

Looking at the summary of algorithm table above, the model generated 70 rules. We can always play around with the generated rules by changing the parameters (support, lift and confidence). Rule length distribution tells us that the length of three tenants has the most rules of 56. Which means people are generally spending 7 or more minutes in 3 stores. Let’s explore the top 10 rules little bit more in detail real business would generally be interested in few sets of effective rules only.

# Lets see top 10 rules
inspect(association.rules[1:10])

As mentioned in the beginning of the blog, our main interest of measure would be lift. Looking at the first few rows, we can say that people who go to Birks are 12 times more likely to go to Nordstrom based on 1,079 observations and people who go to Sandro are also 12 times more likely to go to Nordstrom based on 1,991 observations. So, if i am Nordstrom manager, i would work on strategy to attract and retain this group of customers or take a different approach and work on strategy to attract the customers where lift is less than 1. Let’s Visualize top 10 rules.

5. More About Rules

Let’s plot these rules to understand them better.

# plot rules
plot(association.rules)

The graph above displays all the rules generated by the model where horizontal axis is support, the vertical axis is confidence and the color shading is lift. It shows that lift are scattered all over as there are few instances of high lift at low support and low confidence but majority of high lift rules occur at low support and high confidence.

Let’s understand the quality of these rules

Scatter plot Matrix on Support, Confidence and Lift

The matrix graph above shows that lift is proportional to the confidence and there are few linear grouping in the rules. When the support of Y remains the same, lift is proportional to the confidence, and the slope of linear trend is reciprocal of Y remains the same.

# compute the 1/support(y)
slope <- sort(round(association.rules@quality$lift / association.rules@quality$confidence, 2))# Display number of times each slope appears in the dataset
unlist(lapply(split(slope, f = slope), length))

The support and slope result above shows that out of 70 rules, there are 12 values for 1/support and majority occurs at slope of 12.47, 15.25, 17.88 and 18.27 as seen on scatter plot matrix above. Another interesting way of visualizing and understanding rules is by creating Parallel coordinates plot.

subRules2 <- head(association.rules, n = 25, by = "lift")
plot(subRules2, method = "paracoord")

The parallel plot helps us to visualize the rules. For example, if i go to Joey Restaurant, i am likely to go to H & M and Microsoft. And if i go to Canada Goose, i am likely to go to Nordstrom.

6. Understanding Different Scenarios

So far we have learned about rules and general pattern. But what if we want to dig dipper about particular store. This sort of analysis will be particularly helpful for store managers to understand where people visit before or after visiting their stores. In the section below, we will pick a store, Nike for this example and understand where people are visiting before visiting Nike or after visiting Nike. Our focus will be on top 20 rules. We will start by understanding where people visit before visiting Nike.

## Finding where customers are going before going to NikeGoose.association.rules <- apriori(tr,
  parameter = list(supp = 0.0001, conf = 0.0001),
  appearance = list(default = "lhs", rhs = "Nike")
)nike_20 <- head(sort(Goose.association.rules, by = "lift"), 20)
inspect(nike_20)plot(head(sort(Goose.association.rules, by = "lift"), 20),
  method = "graph"
)

The table above shows us the combination of stores people usually visit before visiting Nike. Top rule tells us that people visiting Dynamite and LensCrafters are 9.7 times likely to visit Nike followed by people visiting Gap Kids and H & M. Let’s graph the rules.

The inward pointing arrow in the graph shows people movement before visiting Nike store. Now, let’s understand where people visit after visiting Nike or alongside Nike.

## Finding where customers go after going to Nike or other stores they frequent alongside Nikewithnike.association.rules <- apriori(tr, parameter = list(supp = 0.0001, conf = 0.0001), appearance = list(lhs = "Nike", default = "rhs"))nikewith_20 <- head(sort(withnike.association.rules, by = "lift"), 20)
inspect(nikewith_20)plot(head(sort(withnike.association.rules, by = "lift"), 20),
  method = "graph"
)

Table above shows us people store visitation pattern after visiting Nike. The top 20 rules tells us that people are 8.7 times likely to visit Jack & Jones after visiting Nike based on 202 observations and 4.1 times likely to visit Skechers after visiting Nike based on 129 observations.

7. Conclusions

It is always important to use the business knowledge and context while interpreting association rules as they can be influenced by confounding factors, i.e hidden variable not included in analysis which at times can cause observed relationship to disappear or reverse, commonly known as Simpson’s Paradox.

There are different types of association rules algorithm, the one we discussed in the blog was Apriori. Some other types of algorithm that can be used for association rules are PCY algorithm (improves Apriori by using hash tables) and FP Growth (uses compressed representation of database using FP tree and then divides and conquer to mine the data).

References: Queens University Master of Management Analytics Program (Course Name; Big Data Analytics)

Association Rules: Unsupervised Learning in Retail

Written by Manil Wagle