Web Behavior Analytics — 104

Basic intro to product recommendation with association rules

This series of blog posts will introduce the four areas that make up of an integrated CRM approach: web behavior, social behavior, transaction, and demographic.

The purpose of this post is to explain and showcase association rules applied on different product pages viewed by each website visitor, through the Apriori algorithm.

Tools used: Google Analytics, Excel, R


Most eCommerce websites nowadays accumulate vast amount of customer data everyday. Such data includes purchase history of individual customer, commonly known as market basket transactions.

Screenshot of market basket transaction data example

Marketers have been trying to understand their customers based on purchase behaviors, by studying the relationship between products bought together. However, this association relationship is not limited to actual purchases only; other aspects such as as web pages viewed, movies watched, and songs listened could be applied as well.


An eCommerce website wants to know which product pages were viewed together by their customers, in order to provide them with a more personalized experience, through recommending other relevant products.


Instead of mining associations on the entire dataset, we will follow the Apriori principle, which states that if an item is frequent, then all of its subsets must also be frequent, as explained here.

Each letter in the diagram below represents a particular product or page.

Screenshot of an illustration of the Apriori principle. If {c,d,e} is frequent, then all subsets of this itemset are frequent.


Support counting is the process of determining the frequency of occurrence for every candidate itemset based on the threshold we decide. For example, if we only want to examine itemsets that occur in 1 out of 5 transactions, then support should be set as 0.2, or 20%.


Confidence is an indication of how often the rule has been found to be true. It can also be interpret as the conditional probability of {d} given {c}, where {c} is the LHS (left-hand-side) condition and {d} is the RHS (right-hand-side) of the rule. For example, if we have find a rule of {c} in association to {d}, we would want to know how often this rule is true in the entire dataset. We will first look into rules with above 0.5, or 50% confidence.


Lift is an indicator that takes into account of both the confidence and the entire dataset. It represents the independence between {c} and {d}. If the value of lift is 1, that means {c} and {d} are independent of each other and hence no rules can be drawn; if the value of lift is greater than 1, that means that {c} and {d} are positively correlated (negatively correlated if lift is less than 1), and could be useful in future predictions.

Data Preparation

Product pages have unique product IDs in their URLs that may look like this:


where pid?193 represents a product with ID#193.

From the previous post, we know how to reorganize the pages viewed.

Screenshot of page product path for each customer in original format
#Extract only the id number from pages; .*= represents the string before the id number
mydata$pages <- sub(".*=","",mydata$pages)
#Reorganize the data by customer_id
path <- ddply(mydata, "customer_id",
collapse = ","))
#Save the file in .csv format
Screenshot of page product path for each customer in desired format
Model Implementation
#Download the arules package
#Load the package
#Import the .csv file into transaction format
transaction = read.transactions(file="path.csv",rm.duplicates=TRUE,format="basket",sep=",",cols=1)
#Apply the apriori algorithm
rules <- apriori(transaction,parameter=list(sup=0.1,conf=0.5,target="rules"))
#Display the rules

Unfortunately, no rules are found with 10% support and 50% confidence.

Screenshot of no rule returned with 10% support and 50% confidence

However, with 0.1% support and 40% confidence, we found 2 rules.

Screenshot of 2 rules returned with 0.1% support and 40% confidence

While confidence and lift are relatively promising, the low support value may be because products #36 and #39 are less frequently viewed.

Performance Improvements

Keep in mind that the demo above used pages viewed from all customers, and hence we can further segment our customers by other elements such as demographic and product category. For example, if we only want to study the associated product pages viewed by male customers between age 20–40, or we only want to know the association rules for the computer product pages, then we can only import the segmented data into the model, instead of the full dataset.

Other advanced techniques can be applied as well, such as multiple-level and multi-dimensional association rules.

In addition to association rules, another popular approach to pattern discovery is through the Markov model, although often critized by its lack of predictive accuracy (as it is the algorithm used in weather forecast!). However, there are methods that combine the two algorithms as well.

This post is the last post in the web behavior analytics 100 level series. The upcoming posts will focus on more advanced web analytics techniques.

Questions, comments, or concerns?