Association Rule Mining — concept and implementation

A comprehensive guide to solving Market Basket Analysis problem

Amardeep Chauhan
Analytics Vidhya
9 min readApr 26, 2020

--

source: https://www.exchange4media.com/

Association rule mining is one of the major concepts of Data mining and Machine learning, it is simply used to identify the occurrence pattern in a large dataset. We establish a set of rules to find out how the positioning of different items is affecting each other. These patterns can be of any type like Telephone contacting pattern, suspicious activity pattern, pattern in symptoms of any disease, and customer shopping pattern. Here we will focus on customer shopping patter, a more suitable term is Market Basket Analysis.

Market Basket Analysis is one of the most popular techniques in finding the best product placement in the store and deciding offers which increases the overall sales. The idea is to bring the set of products together that have some kind of inter-dependency in terms of their use. Doing so can surely boost the sale because placing them together will remind or encourage customers about their need for the associated product. To solve this, we develop all the possible sets of product’s association rules and find out the most effective ones. Now the question is how do we develop and find the effectiveness of these association rules? The answer is the Apriori Algorithm.

Apriori Algorithm

Concept:

The algorithm was introduced by Mr. R. Agrawal and Mr. R. Srikant in 1994, they called it ‘Apriori’ because it uses prior information, i.e., existing transactions, to find out associations and patterns.

Before we dive into its working we must know about its properties:

  • Downward closure property which says that subsets of a frequent itemset should also be frequent.
  • All infrequent item subset also has infrequent item superset.

Association measures:

It follows three measures to find out associations. let’s understand how these measures are calculated with one small example set:

Items bought in different transactions

It contains only 10 transactions, so it will be easier to calculate and understand Support, Confidence, and Lift.

  1. Support: It is the occurrence percent of the item/itemset or in simple terms popularity of item(s). It can be calculated by simply finding out the proportion of transactions containing item(s).

For example

Support (Soap) = (4/10)*100 = 40%

Means 40% of the transactions include Chocolate.

2. Confidence: It is the likelihood or trustworthiness of association rule or in simple terms, I should say it tells us how often our rule was valid; let’s say, if we are looking for patterns where Item Y bought after item X then Confidence can be calculated as:

For example:

Confidence(Perfume -> Soap) = Support (Perfume U Soap)/Support (Perfume)

= (30/40) * 100

Means 75% of the time when a customer bought Perfume, he ended up buying Soap as well.

3. Lift: It is nothing but a ratio of Confidence (of association rule) and Expected Confidence. Here Expected Confidence implies that there is no such pattern exists, means the sale of items in an association rule is independent. Look at this formula, it can be better understood with one example:

In this formula, Expected Confidence is nothing but an occurrence of item Y in transactions, independent of association with item X.

If we get:

Lift = ~1, means the sale of item Y is independent and it was going to happen in the same amount anyways irrespective of its association with any other item, so we should not find any association between these items.

Lift > 1, means the sale of items in association rule has a strong positive relationship or in other words, people tend to buy these items together than Y alone, so X boosts the sale of Y as well when placed together.

Lift < 1, means the sale of items in association rule has negative/inverse relationship i.e. they are substitutes of each other and the presence of one can bring down the sale of others or vice versa.

Lift (Perfume -> Soap) = Confidence (Perfume -> Soap)/Support (Soap)

=75/40 =1.87

Means Perfume and Soap have a strong relationship and the presence of Perfume boosts the sale of Soap.

So now we understood what various measures are to identify strong association rules, and are ready to understand the complete working of the Apriori algorithm.

Complete working:

This Flow chart explains the working of this algorithm and we will understand that with an example but first we need to take the threshold values for Support and Confidence.

Support Threshold (or min_supp) = 30%

Confidence Threshold (or min_conf) = 70%

These thresholds are nothing but a minimum criterion that will be used in pruning for picking the popular Itemsets and strong association rules. We also need to understand that these threshold values should be based on the kind of items and Market size. I mean in real case there will be so many items in any store, so you can’t set 30% as support threshold because not all items are part of our daily needs so they might be less popular.

Step 1:

A. Create 1-Itemset candidates and calculate support for all the items.

B. Perform pruning to create L1 Frequent Itemset. In pruning, we will filter out all items with Support less than the min_supp value(30).

1-Itemset Candidates and L1 Frequent Itemset

Step 2:

A. Create 2-Itemset candidates from L1 Frequent Itemset and calculate support for all of them.

B. Perform Pruning to create L2 Frequent Itemset. As before, we will again filter out all the Itemsets with Support less than min_supp value(30).

2-Itemset Candidates and L2 Frequent Itemset

Step 3:

A. Create 3-Itemset candidates from L2 Frequent Itemset and calculate support for all of them.

B. Perform pruning to create L3 Frequent Itemset. But here if you see, Support is less than min_supp value(30) for all the 3-itemsets so we cannot go any forward and need to find Association Rules from L2 Frequent Itemset only.

3-Itemset Candidates

Final Step: Create Association Rules

A. We need to calculate the Confidence for all combinations of items in the L2 Frequent Itemset.

Obtained Association Rules

B. Perform pruning for again, this time to filter out all those association rules that have Confidence(%) less than the min_conf(70%).

Additionally, we have also calculated Lift values to better understand the kind of impact this association rule is going to make on the sale of individual items. However, we are not using it as a rule selection criteria.

Strong Association Rules

So finally, we have obtained Association Rules, which can be used in any store to boost the sale.

Implementation in Python

We will perform simple Market basket analysis for the one really small and simple dataset which contains ~7500 observation of ~120 item’s transaction pattern, below is the link for this dataset:

https://www.kaggle.com/roshansharma/market-basket-optimization

Library Import

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_rows', None)
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

read data

market_basket_df = pd.read_csv('./Market_Basket_Optimisation.csv', header=None)
market_basket_df.head()

As you can see, the dataset contains no column or index label, so let’s go ahead and assume each row as a transaction or basket. We need to bring it in the proper structure, only then we can do some further analysis. We can create columns for each item and keep its row value as True/False based on its occurrence in that transaction.

basket_items = []
for index, row in market_basket_df.iterrows():
cleansed_items = [item for item in row if str(item)!=’nan’]
basket_items.append(cleansed_items)
basket_items[:3]

We’ll use TranscationEncoder imported from me extend and pass the basket items list that we created. It will one hot encode transaction column values based on their occurrence as we talked about.

tran_encod = TransactionEncoder()
tran_encod_list = tran_encod.fit(basket_items).transform(basket_items)
transaction_df = pd.DataFrame(tran_encod_list, columns=tran_encod.columns_)
transaction_df.head()

Creating DataFrame for item frequency

item_count = {}
for col in transaction_df.columns:
item_count[col] = transaction_df[col].sum()
item_freq_df = pd.DataFrame(data=list(item_count.values()), index=list(item_count.keys()), columns=['frequency']).sort_values(by='frequency', ascending=False)
item_freq_df.shape, item_freq_df.head(10)

ok, so we have 120 unique items, let’s check their frequency in bar plot, insight can be helpful in further decision making.

plt.figure(figsize=(16,7))
sns.barplot(y=item_freq_df.index[:10], x=item_freq_df.frequency[:10])
plt.xticks(rotation=90)

This plot contains only the top 10 items, please don’t put any limit when you are plotting on your jupyter notebook. A couple of interesting observations (from the complete plot):

  • People are becoming health conscious, they prefer green tea over tea. Oh hold on.. they are consuming much more spaghetti, french fries, chocolate, burgers, cake, and cookies as compared to oatmeal and veggies.. coincidence!!! Nah, this is how it happens, we eat lots of junk and try to balance it out with green tea, we are SMART 😅
  • The world before Covid19, Napkins were not in much demand.

ok, let’s come to the point. We have a total of 7501 observations and only 7 items with a frequency greater than 750. It means only 7 items have support greater than 10%. Let’s cross-validate:

apriori(transaction_df, min_support=0.1, use_colnames=True)

So now what?? well, now we need to decide some realistic min_support only then we’ll be able to find some useful association rules.

print(f’freq>200: {item_freq_df[item_freq_df.frequency>200].shape[0]} items’)
print(f’freq>100: {item_freq_df[item_freq_df.frequency>100].shape[0]} items’)
print(f’freq>50: {item_freq_df[item_freq_df.frequency>50].shape[0]} items’)

it looks like if we take ~220 freq limit then we will get ~50 unique items, which will give some decent number of support sets.

pd.set_option(‘display.max_rows’, 15)
freq_itemset_support = apriori(transaction_df, min_support=0.03, use_colnames=True)
freq_itemset_support

Finding the best association rules based on min_confidence score as 20%

overal_association_rules = association_rules(freq_itemset_support, metric="confidence", min_threshold=0.2)
overal_association_rules

Well, if we take 20% as confidence score, association rules are mostly dominated by mineral water association. Mineral water is already associated with most of the products, so we better exclude it from transactions to find out other meaningful association rules.

trans_wo_water_df = transaction_df.drop(columns=['mineral water'])freq_itemset_wo_water_supp = apriori(trans_wo_water_df, min_support=0.02, use_colnames=True)
freq_itemset_wo_water_supp
wo_water_assoc_rules = association_rules(freq_itemset_wo_water_supp, metric=”confidence”, min_threshold=0.2)
wo_water_assoc_rules

hmm, let’s order it by _confidence_ and then _lift_

wo_water_assoc_rules.sort_values('confidence', ascending=False)
wo_water_assoc_rules.sort_values('lift', ascending=False)

Ok, now we see few meaningful associations, like:

  • ground beef -> spaghetti
  • herb & pepper -> ground beef
  • red wine -> spaghetti
  • tomatoes -> frozen vegetables
  • frozen vegetables -> spaghetti
  • (chocolate, spaghetti) -> milk
  • burgers -> eggs
  • burgers -> french fries
  • pancakes -> french fries
  • milk -> chocolate
  • milk -> eggs
  • olive oil -> spaghetti

There are few weird associations as well (at least as per Indian taste):

  • ground beef -> milk
  • champagne -> chocolate
  • olive oil -> chocolate
  • shrimp -> milk
  • green tea -> french fries

If you’ve noticed these associations rules are still dominated by a few of the most frequent products:

  • eggs
  • spaghetti
  • chocolate
  • milk
  • ground beef
  • frozen vegetable

You can narrow it down further and apply various filters either based on confidence or lift and generate many other association rules. Just note that the quality of association rule is dependent on the quality of data, or better I say the authenticity of data. For actual problems and real datasets, you can’t simply decide min_support and min_confidence on a whim; it requires some critical thinking. However, you got to start from somewhere, and that’s what I tried to give you. Later we’ll explore Market basket differential analysis.

Got any questions/feedback! feel free to comment.

Happy learning.

--

--

Amardeep Chauhan
Analytics Vidhya

Critical Thinking | Analytics | Programming | Senior Data Scientist at AstraZeneca | https://www.linkedin.com/in/amar09/