Association Analysis — Primer

Marek Kilanowski
Sep 9, 2018 · 8 min read

The Purpose

The purpose of Association Analysis is to analyze data to discover hidden data relationships. For example in case of retailer it is used to learn about purchasing behavior of customers. The acquired knowledge can be utilized to support various business-related applications like organizing marketing campaigns for promotions, managing inventory levels, maintaining customer relationships etc

Data Formats

Basic concepts and terminology used in Association Analysis will be explained using the small mock-up data, a sample of purchase data collected at the checkout counter of grocery store. Transaction data collected can be generally presented in one of the three formats: horizontal, vertical and binary as explained below.

Horizontal Data Format

In horizontal format each data row represents a single transaction with the corresponding list of items. Transaction ID (TID) is a unique transaction identifier. In the below example the second transaction (TID=2) contains only two items purchased: Eggs and Cola.

Vertical Data Layout

A pool of transactions is analyzed for which the distinct list of items is identified, each shown in a column header. A column contains the list of transaction identifiers associated with the item in column header. For example across all 10 transactions there are only two where milk was purchased (TID=4 and TID=10).

Binary Data Representation (Cross Reference Matrix)

Data can be also presented in cross reference table between transactions and items. In this format all transaction identifiers are listed vertically and the distinct items identified across all transactions are listed horizontally. In this format each row represents transaction and each column corresponds to an item. A single cell which is a cross between a single transaction (row) and a single item (column) has a binary value 1 denoting presence of an item in a transaction and 0 otherwise.

Basic Concepts and Terminology

Itemset

Let set of transactions be denoted as

and set of items containing distinct elements identified across all transactions to be

Each transaction in T has a unique Transaction ID (TID) and each contains the subset of items from set I. In our example we have 10 transactions and 7 items thus N=10 and k=7.

𝐼 = {𝐹𝑙𝑜𝑢𝑟, 𝐸𝑔𝑔𝑠, 𝐶𝑜𝑙𝑎, 𝑀𝑖𝑙𝑘,𝐷𝑖𝑎𝑝𝑒𝑟𝑠, 𝐵𝑒𝑒𝑟, 𝑆𝑢𝑔𝑎𝑟 }

Itemset is a subset of I (including empty set).

For example the set { 𝐸𝑔𝑔𝑠, 𝐶𝑜𝑙𝑎 } is a subset of I representing the second transaction.

Itemset Support Count

Support count for an itemset is the number of transactions containing this itemset.

For example to calculate Itemset Support Count for set { 𝐹𝑙𝑜𝑢𝑟, 𝐸𝑔𝑔𝑠 } one needs to identify all transactions where both items Flour and Eggs have been purchased. The straightforward method is to identify and count in the cross reference matrix all rows having 1 in both columns Flour and Eggs as shown below.

Itemset {Flour, Eggs} is contained in three transactions, thus itemset support count for this itemset equals to 3. Support count for an itemset X can be represented more formally as shown below. Here |.| denotes the number of elements in the dataset. It is the number of transactions that contain itemset X as a subset.

For example

Itemset Support

Support for an itemset X is a fractional number representing the ratio of itemset support count of an itemset X to the number of all transactions. Here |T| represents the total number of transactions.

Itemset Support for dataset X can be interpreted as the probability that items in dataset X occur together considering all transactions.

For example

Association Rule

An association rule is the method of representing uncovered data relationships. It has a form of implication expression where antecedent X and consequent Y are disjoint itemsets:

Antecedent X is also called left-hand-side (LHS) and consequent Y is called right-hand-side (RHS).

The below association rule can be interpreted as follows: If a customer purchased both diapers and beer then it is likely she will also buy sugar. {𝐷𝑖𝑎𝑝𝑒𝑟𝑠, 𝐵𝑒𝑒𝑟} ⇒ {𝑆𝑢𝑔𝑎𝑟}.

Association Rule Support Count

For a given association rule 𝑋 ⇒ 𝑌 we can ask the following question: how many transactions contain items belonging to both itemsets X and Y?

For a given association rule 𝑋 ⇒ 𝑌 and a given transaction t it is said that the association rule applies to transaction if transaction contains both itemsets X and Y or stated differently when both itemsets X and Y are the subsets of transaction t . More formally:

Support Count for an Association Rule is the number of transactions to which the rule applies. More formally:

Example: Let’s consider the following association rule {𝐷𝑖𝑎𝑝𝑒𝑟𝑠} ⇒ {𝐵𝑒𝑒𝑟}. Here the union of antecedent and consequent is a set: {𝐷𝑖𝑎𝑝𝑒𝑟𝑠, 𝐵𝑒𝑒𝑟}. The task is to identify and count transactions in set T which have set {𝐷𝑖𝑎𝑝𝑒𝑟𝑠, 𝐵𝑒𝑒𝑟} as a subset. Observe two such transactions, hence

Association Rule Support

Support for an association rule is a term to measure how often the association rule can be applied to transactions in T in relation to the total number of transactions in T. Support of an association rule measures the frequency of applicability of the rule to transactions. It can be interpreted as the probability that a given association rule is applicable to transaction.

In particular it is computed as the ratio of association rule support count to the number of all transactions in a dataset. Formally:

High support of an association rule means that items from antecedent and consequent of the implication expression representing the rule often occur together. Such a rule might be worth to investigate since it can potentially affect most of transactions from set T. A retailer can utilize it for example to efficiently promote items purchased together.

On the other hand one may identify association rules with the lowest support to exclude them from further analysis since they represent cases where items occurred together simply by chance.

Support for association rule X => Y can be also viewed as the probability 𝑃(𝑋∧𝑌) where 𝑋∧𝑌 is an event that items combined from antecedent and consequent of the rule are a subset of a transaction. For example:

Association Rule Confidence

Rule’s confidence is a number that provides measure of how often the rule has been found to be true. In particular the confidence of an association rule 𝑋⇒𝑌, with respect to set of transactions T, measures how frequently items from itemset X also appear in itemset Y. Association rule confidence can be formally presented as follows:

Confidence of an association rule measures the reliability of the inference made by the rule. High confidence of a rule means high reliability of inference. Thus for an association rule 𝑋⇒𝑌 with high confidence we could state that if an item occurs in a transaction containing itemset X then it is very likely it will also occur in transaction containing itemset Y.

Confidence of an association rule X => Y can be also viewed as conditional probability 𝑃(𝑌/𝑋) . It is the probability of seeing rule’s consequent under the condition that the transactions also contain the antecedent.

Association rules with the highest confidence are the ones with the strongest co-occurrence among items from antecedent and consequent of the rule. Higher chance of co-occurrence means better estimate of conditional probability of Y given X. This is why association rules with high confidence are the ones be considered for analysis.

Example 1: Consider association rule: {𝐹𝑙𝑜𝑢𝑟}⇒{𝐸𝑔𝑔𝑠}

Example 2: Consider association rule: {𝐸𝑔𝑔𝑠}⇒{𝐹𝑙𝑜𝑢𝑟}

Observe the higher confidence of the association rule in example 2 than in example 1. It is due to the fact that antecedent of implication in example 2 has lower support count than for example 1.

Association Rule Lift

Confidence of an association rule ignores the support of the itemset in rule’s consequent, which might lead to inaccurate conclusions. This is addressed by introducing the metric known as Lift which computes the ratio between the rule’s confidence and the support of the itemset appearing in rule’s consequent.

Lift measures the strength of correlation between antecedent and consequent of the rule and is interpreted as below. Rules with high positive Lift value resemble patterns that are not occurring by chance but are the result of hidden relationships in data.

Example: Consider association rule:{𝐹𝑙𝑜𝑢𝑟,𝐸𝑔𝑔𝑠}⇒{𝐶𝑜𝑙𝑎}

Since the antecedent and consequent are negatively correlated the conclusion is that if a customer buys flour and egg than it is unlikely she buys cola.

Formulating the Association Rule Mining Problem

Association rules with the high support and high confidence are the most attractive for the business for analysis. However in practice association rules with high support often do not have high confidence. On the other hand those with high confidence often do not have high support.

For the above reasons the problem of mining of association rules is formulated by using the threshold values for support and confidence. More formally the problem of mining can be formulated as below.

Marek Kilanowski

Written by

A Data Science enthusiast and Software Development Engineer with broad spectrum of domain expertise, technical knowledge and proven success.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade