Fundamentals of Associate Rule mining

Sandaruwan Herath
Data Science and Machine Learning
10 min readJan 21, 2024

Association rules offer a powerful tool for data analysis, providing insights into patterns and relationships within large datasets. While they are a staple in market basket analysis, their application extends across various domains, offering invaluable insights into customer behaviour and beyond.

Figure 1: DALL — E

Introduction to Association Rules in Data Mining

Association rule mining is a technique in data mining for discovering interesting relationships, frequent patterns, associations, or correlations, between variables in large datasets. It’s widely used in various fields such as market basket analysis, web usage mining, bioinformatics, and more. The basic idea is to find rules that predict the occurrence of an item based on the occurrences of other items in the transaction.

Understanding the Basics

To explain the association rule mining, we can use a simple example of a grocery store’s transaction data. Let’s start by defining a sample transaction table and then move on to discuss item sets and association rules derived from this data.

Imagine a small dataset representing transactions in a grocery store:

Table 1: Sample Transaction Table

In this table, each row represents a transaction (a customer’s purchase), and each transaction has a unique ID. The ‘Items Purchased’ column lists the items bought in that transaction.

Concept of Itemset

An ‘item’ is a collection of one or more items found within a dataset. For example, consider a dataset containing various groceries. An item could be a combination like {Cheese, Tomato}.

The ‘length’ of an item set is the number of items it contains. Thus, {Cheese, Tomato} is a 2-itemset.

· Single item itemsets: {Milk}, {Bread}, {Butter}, {Diapers}, {Beer}, {Cola}

· Two-item itemsets: {Milk, Bread}, {Bread, Butter}, {Diapers, Beer}, etc.

· Three-item itemsets: {Milk, Bread, Butter}, {Bread, Diapers, Beer}, etc.

Association Rules

· An association rule is a fundamental concept in data mining that reveals how items within a dataset are connected. It’s a directive that suggests a strong, potentially useful relationship between two sets of items.

· These rules are expressed in the form of “If-Then” statements, typically written as {X} → {Y}, where X and Y are different sets of items.

Example

To illustrate, consider a rule like {Diapers} →{Baby Wipes}. This rule suggests that in transactions where diapers are bought, there’s a strong likelihood that baby wipes are also purchased.

Market Basket Analysis

This is a primary application of association rules, focusing on analyzing purchase patterns. By examining combinations of items that frequently occur together in purchases, businesses can gain insights into marketing and sales strategies.

Figure 2: Market Basket Analysis [https://www.sciencedirect.com/topics/computer-science/market-basket-analysis ]

Components of Association Rules — Antecedent and Consequent:

{BREAD, MILK} →{BEER}

Antecedent →Consequent

· Every association rule has two parts: the antecedent (if) and the consequent (then). For instance, in the rule {Pasta →Sauce}, ‘Pasta’ is the antecedent, and ‘Sauce’ is the consequent.

· Antecedent (X): This is the first part of the rule, the condition. It’s the set of items found in the database that you are examining for patterns. In the rule {X} →{Y}, X is the antecedent.

· Consequent (Y): This is the second part of the rule, which is inferred from the presence of the antecedent in transactions. In the rule {X} →{Y}, Y is the consequent.

· These sets are disjoint, meaning they do not overlap.

How Association Rules are Evaluated

The strength and reliability of an association rule are measured using three key metrics.

Association rule mesures

· Support: These measure how frequently the item (both X and Y together) appears in the dataset. It’s the proportion of transactions that contain both X and Y to all transactions. High support indicates that the items are common in transactions.

· Confidence: This measures the likelihood of finding the consequence in transactions under the condition that these transactions also contain the antecedent. For example, in a rule like {Bread à Butter}, the confidence measures how often butter is bought when bread is bought.

· Lift: These measures show how much more often the antecedent and consequence of the rule occur together than we would expect if they were statistically independent (This measures the strength of the association between X and Y.). A lift value greater than 1 implies that the antecedent and consequent are dependent. (Lift evaluates the strength of the rule over the randomness. A lift greater than 1 indicates a strong rule).

Association Rules: A Step-by-Step Example

Now, let’s derive some potential association rules from Table 1 data:

Imagine a dataset of customer purchases from a grocery store. Each transaction is a record of different items bought together. By applying association rules, we can discover patterns like customers who buy ‘Tea’ often also buy ‘Honey’.

Example 01

· Rule: {Bread} →{Milk}

· Interpretation: Customers who buy Bread are likely to buy Milk as well.

· Support: Calculated the number of transactions containing both Bread and Milk divided by the total number of transactions.

Identify Transactions Containing Both Bread and Milk:

o T1 (Milk, Bread, Butter)

o T4 (Bread, Milk, Diapers, Beer)

o T5 (Bread, Milk, Diapers, Cola)

Here, 3 out of 5 transactions contain both Bread and Milk.

Calculate Support:

Support = 3 (transactions with both items) / 5 (total transactions) = 0.6 or 60%

· Confidence: The number of transactions with both Bread and Milk divided by the number of transactions with just Bread.

Identify Transactions Containing Bread: T1, T2, T4, T5

4 transactions contain Bread.

Calculate Confidence:

Confidence = 3 (transactions with both Bread and Milk) / 4 (transactions with Bread) = 0.75 or 75%

· This calculation shows that not only is the combination of Bread and Milk common in this dataset but also that there’s a high likelihood of Milk being purchased when Bread is bought.

Example 02

· Rule: {Diapers, Beer} →{Milk}

· Interpretation: Customers who buy Diapers and Beer together are likely to buy Milk too.

· Support: The support for this rule is 0.4, meaning that Diapers, Beer, and Milk are bought together in 40% of all transactions. This is calculated by dividing the number of transactions that include all three items (Diapers, Beer, and Milk) by the total number of transactions.

· Confidence: The confidence for this rule is approximately 0.67 (or 66.67%), indicating that in about 67% of the transactions where Diapers and Beer are bought, Milk is also purchased. This is calculated by dividing the number of transactions that include all three items by the number of transactions that include just Diapers and Beer.

Example 03

· Rule: {Milk, Bread} →{Diapers}

· Interpretation: Customers who buy Milk and Bread are likely to buy Diapers.

· Support: The support for this rule is also 0.4, which means that Milk, Bread, and Diapers are purchased together in 40% of all transactions. This is computed in the same way as above, considering the transactions that include Milk, Bread, and Diapers.

· Confidence: The confidence for this rule is also about 0.67, suggesting that in approximately 67% of the transactions where Milk and Bread are bought, Diapers are also purchased. This is found by dividing the number of transactions with all three items (Milk, Bread, and Diapers) by the number of transactions with just Milk and Bread.

Note:

· In this example, we would calculate the support and confidence for each rule to determine their strength and relevance.

· It’s important to note that these are just hypothetical rules. Actual analysis may require more complex algorithms like Apriori or FP-Growth and consideration of additional factors like lift.

Practical Applications

· Market Basket Analysis: Understanding customer purchasing patterns, like finding that customers who buy pasta also often buy tomato sauce.

· Retail Insights: Identifying products that are often purchased together to optimize store layout or promotions.

· Cross-Selling Strategies: Suggesting additional products to customers based on their current selections.

· Healthcare and Research: In medical research, association rules can help in identifying correlations between different symptoms or drug interactions.

· Fraud Detection: Identifying patterns in transactions that might indicate fraudulent activity.

· Recommendation Systems: Enhancing customer experience by recommending relevant items. For example, online streaming services recommend movies based on viewing history.

Critical Considerations

· Non-causality: It’s crucial to remember that these rules indicate correlation, not causation. They highlight patterns of co-occurrence, not the reasons behind them.

· Domain Knowledge: To derive meaningful conclusions, domain expertise is essential to interpret these patterns correctly.

Challenges in Generating Association Rules

· Large Number of Rules: Even with a small number of items, the potential number of rules can be vast, many of which may not be useful.

· Setting Thresholds: Determining the right support and confidence thresholds requires domain knowledge and can significantly impact the quality of the generated rules.

· Interpretation and Actionability: Not all discovered rules are useful or actionable. Some might be trivial or require further investigation to understand their practical implications.

Frequent Itemset Generation

Continuing from the discussion on association rules and their classifications, let’s delve deeper into the concept of frequent itemset generation, a crucial step in mining association rules.

Understanding Frequent Itemset

· Frequent Itemset: These are combinations of items that appear frequently in your dataset. The frequency is determined based on a user-defined threshold, known as the support threshold.

· Support Threshold: This is a critical parameter that helps in identifying items that are significant for analysis. It’s a percentage indicating how often an item should appear in the dataset to be considered frequent.

· Example of Frequent Itemset

Let’s return to our grocery store example. Suppose we set a support threshold of 50%. An item set like {Milk, Bread} might be considered frequent if it appears in 50% or more of the transactions.

· Significance of Frequent Itemset

Filtering Out Noise: By focusing on frequent item sets, we can ignore item combinations that occur rarely and are likely insignificant or coincidental.

· Efficiency: This approach greatly reduces computational complexity by limiting the number of item sets and association rules we need to consider.

Calculating Support

· Support Calculation: The support of an itemset is calculated as the number of transactions in which the item set appears divided by the total number of transactions.

· Example Calculation: If there are 100 transactions and {Milk, Bread} appears in 40 of them, the support is 40/100 = 0.4 or 40%.

Approaches to Frequent Itemset Generation

· Brute Force Approach:

o This involves calculating the support for every possible itemset, which becomes computationally impractical for large datasets.

o For example, a dataset with 20 distinct items could potentially generate billions of itemset combinations.

· Frequent Itemset Generation Approach:

o This method first identifies all item sets that meet the minimum support threshold and then generates rules from these item sets.

o It significantly reduces the number of item set to consider but can still be computationally demanding for large datasets.

Advanced Techniques for Efficiency

Apriori Algorithm:

· This algorithm leverages the principle that all subsets of a frequent itemset must also be frequent (Apriori Property). It iteratively extends frequent item sets, one item at a time, and prunes subsets that fail to meet the support threshold.

· Example: If {Milk, Bread} is frequent, then both {Milk} and {Bread} should also be frequent.

FP-Growth Algorithm:

· This approach uses a compact data structure called the FP-tree (Frequent Pattern tree) to compress the dataset, allowing faster generation of frequent itemsets without candidate generation.

· It’s often faster than the Apriori algorithm, especially for large datasets.

Example

Let’s create a concrete example covering the process of frequent itemset generation, particularly focusing on the use of the Apriori algorithm. This example will illustrate how we move from a raw dataset to identifying frequent item sets and then to generating association rules.

Sample Dataset

Sample Table

Imagine we have a small dataset from a bookstore. Each transaction records books purchased together by a customer.

Setting a Support Threshold

Suppose we set our support threshold at 60%. This means an item set must appear in at least 60% of all transactions to be considered frequent.

Step 1: Identify All Possible Itemset

First, we list all possible item sets:

· 1-itemsets: {History}, {Science}, {Math}, {Art}

· 2-itemsets: {History, Science}, {History, Math}, {History, Art}, {Science, Math}, {Science, Art}, {Math, Art}

· 3-itemsets: {History, Science, Math}, {History, Science, Art}, {History, Math, Art}, {Science, Math, Art}

Step 2: Calculate Support for Each Itemset

Next, we calculate the support for each itemset. For instance:

· Support({History}) = 3/5 = 60% (Appears in T1, T3, T4)

· Support({Math}) = 4/5 = 80% (Appears in T2, T3, T4, T5)

· Support ({History, Math}) = 2/5 = 40% (Appears in T3, T4)

Step 3: Determine Frequent Itemset

Now, we identify which itemset meets our support threshold:

· Frequent 1-itemsets: {History}, {Science}, {Math}, {Art}

· Frequent 2-itemsets: None meet the 60% threshold.

· No frequent 3-itemsets

Step 4: Generate Association Rules from Frequent Itemset

Since no 2-itemsets or 3-itemsets meet the threshold, we focus on 1-itemsets. However, association rules typically involve at least a 2-itemset. In this case, the lack of qualifying 2-item sets suggests that there are no strong associations between pairs of books that meet our support threshold.

Explanation

This example illustrates how the Apriori algorithm helps in identifying frequent itemset in a dataset. In our bookstore scenario, we found that no pair or trio of book categories frequently co-occurred above our threshold, indicating the diverse interests of the bookstore’s customers. This process, while simple in this example, can become complex and computationally intensive with larger datasets, necessitating efficient algorithms like Apriori or FP-Growth.

Summary

Frequent itemset generation is a foundational step in association rule mining. By focusing on significant item combinations and using efficient algorithms like Apriori and FP-Growth, we can extract meaningful, actionable insights from large datasets. This process not only aids in business decision-making but also contributes to various fields like web mining, bioinformatics, and more, by unveiling hidden patterns and associations in data.

--

--

Sandaruwan Herath
Data Science and Machine Learning

IT Consultant/Lecturer | Data Analyst/BI Consultant/Machine Learning