Association Analysis using Apriori Algorithm with example
Generating association rules in Data Mining
Consider yourself in a supermarket!!
Have you ever given thought to neighboring items in the biscuits section? They would mostly be Nam-keen packets or some other snacking item.
Why?
Usually on websites & mobile apps, recommending personalized items for each user is pretty much possible. But is this possible for a physical store(say SuperMarket) to do so?
No !!
So what should such physical stores do?
Just arrange everything randomly? Like biscuits near vegetables/fruits section?
Or can some sort of strategic arrangement done to increase sales? This can be done in the below ways
- Either arrange all the popular products together on the Main Gate.
- Find patterns in the sales & place items that are bought together closer.
I will be discussing the 2nd approach which can be taken as a sort of common Recommendation for everyone visiting the store where personalized recommendations aren’t possible.
Association analysis basically helps to determine the ‘association’ between different items whether it be Grocery, Medical Science, etc.
Let’s get started with a few basic concepts:
Let I={I1, I2, I3, etc} be a collection of all possible items in a given environment. Say, if our environment is SuperMarket, I am the set of all possible items in the supermarket. Do assume
I = {Bread, Diaper, Milk, Beer, Eggs, Cola} for the rest of the post
- Item-set: Any subset of I is called an Item-set. Ex: {Bread, Diaper},{},{Eggs}, etc are Item-sets as all of them are subsets of I. Also, a ‘k’ element subset will be called the K item set.
For example: {beer, eggs} will be called a 2 item-set
- Disjoint Item-sets: Any two item-sets without any common item. Ex: {Bread, Diaper} & {Eggs, Beer} are disjoint from each other.
- Transactions: Transactions are Item-sets in the training data using which we need to figure out the association between items. Let's assume we have a SuperMarket ‘M’ where the below transactions happened yesterday. TID is transaction ID (ignore it for now)
- Association Rule
Let I1 & I2 be 2 disjoint item-sets. If I1 & I2 are associated, then, I1 →I2.
Ex: If I1={Beer} & I2={Diaper, Eggs} & if they are associated, then,
{Beer} →{Diaper, Eggs}. This means if a transaction has ‘Beer’, It might have ‘Diaper’ & ‘Eggs’ as well but not vice-versa. How can know such relations helpful?
- If we are able to establish such rules, we can say that if Beer is bought, diapers & Eggs are also bought & hence all these can be kept close.
- Can be helpful in building sections like ‘Customers who viewed this also viewed this’ like in Amazon.
But, how to figure out whether item-set I1 & I2 are associated?
we have a couple of factors for this. To know how strong the rule I1 →I2 can be, we calculate 2 things:
Let I1={Bread} & I2={Milk}. We need to figure out whether Bread →Milk holds or not.
- Support: For a rule I1 →I2, it is =
Frequency(I1 U I2)/N
Where:
N=Totals transactions(5 in our case, remember the above table)
I1 U I2 = {Bread} U {Milk}={Bread, Milk}.
As, frequency for {Bread,Milk}=3, Support=3/5=0.6
Note: Support for an item-set=Frequency(Item-set)/Total Transactions
- Confidence : For a rule I1 →I2, it is =
Frequency(I1 U I2)/ Frequency(I1) or
Frequency({Bread,Milk})/Frequency({Bread})=3/4=0.75
Hence, for {Bread} →{Milk}, we have Support=0.6, Confidence=0.75
Association Rule Discovery
Given a set of Transactions T, find out all association rules R such that
- Support for any rule R’>Support_Threshold &
- Confidence for the rule R’>Confidence_Threshold
Everything seems fine for now. Now, consider a real-world SuperMarket. It would have as many as 1000s items. If we go by brute force to find out association rules, it will take ages. As a matter of fact, with any environment having ‘d’ items, the total rules to be checked = 3ᵈ — 2ᵈ x 2 +1. With d=1000, this is equal to 5.15 x 10⁴⁷ !!!
So, we usually don’t use Brute Force.
To make things faster, determining strong rules can be split into 2 parts
- Generating Frequent item-set that have a Support>Support_Threshold set
- Calculating Confidence for only frequent item sets. If their confidence is above the Confidence_Threshold set, take them as a strong rule.
- Hence, instead of calculating Support & Confidence simultaneously, do them 1 step at a time.
Here comes Apriori Principle which makes things much faster for generating a Frequent item-set.
Apriori Principle
Here, an item set is called frequent if it has a Support>Support_Threshold set.
It is a composite of 2 major points
- If an Item-set is frequent (Support above Support_Threshold), all its subsets are also frequent. Hence, we don’t need to calculate Support for each rule. Ex:
If Support{Bread, Beer, Eggs} is above the Support_Threshold, then, {Bread, Beer},{Beer, Eggs},{Bread, Eggs}, etc also have their Support above the threshold. Hence, a lot of comparisons were saved.
In the above image, if ‘cde’(last greyed cell) is frequent, all the greyed item sets are also frequent automatically
2. If an item-set A is infrequent, then all its super-sets (sets for which A is a subset) are also infrequent. Hence, if {Bread} is infrequent, {Bread, Eggs},{Bread, Diapers, Beer}, etc all are infrequent & hence their Support shouldn’t be calculated. Again, saving comparisons.
If ‘ab’ is infrequent, every super-set is infrequent & can be ignored then & there.
But why are the above two points valid?
Due to the anti-monotone property of the Support measure.
Monotonicity
- If X⊆Y, then f(X)≤f(Y). If any measure ‘f()’ owes this property, the measure is said to have monotone property.
For ex: Let f() be Count. Then if X(say {1,2})⊆Y(say {1,2,3,4,5}), X can never have more elements than Y. Hence Count(X)≤Count(Y) always holds.
- If X⊆Y, then f(Y)≤f(X). If any measure ‘f()’ owes this property, the measure is said to have anti-monotone property.
For ex: Let f() be Count⁻¹ i.e 1/Count. Hence, if X⊆Y, than Count⁻¹(X)>Count⁻¹(Y) always.
Support also holds this property as if consider {Beer}⊆{Beer, Milk} then, the occurrence of {Beer} will be always more than {Beer, Milk} in overall transactions.
Now, as we look all set now, Its time for the algorithm
Apriori Algorithm
The algorithm helps us to get to the Frequent item set for which Confidence can be calculated to accept as Association Rules very fast. Let’s have a look;
1. Find Frequent 1-item-set F_1.
2. For i=2->k: #k is number of unique items 'CANDIDATE-SET GENERATION'
A. Generate Candidate 'i'-item-set by F_(i-1) X F_(i-1)
#Exception case for i=2 mentioned below
'CANDIDATE SET PRUNING'
B. Find Frequent 'i'-item-set(F_i) from Candidate 'i'-item-set C. If Frequent 'i'-item-set is {}:
Finish & return F_(i-1)
Demystifying it isn’t rocket science.
- F_(i-1) X F_(i-1) is something interesting. If i=2, Every element from F_1 will be considered. Now as i=3, we will be merging the Frequent 2-item-set whose 1st i-2 element is common. Confused? say if I have
{Beer, Diapers} ,{Beer, eggs}, {Eggs, Cola} as Frequent 2-item-set. Now to generate Candidate 3-item-set, it would be just {Beer,Diapers,Eggs} & we will be ignoring {Eggs, Cola} as its 1st element doesn’t matches with other two frequent sets.
An example will clear everything !!
Let us recollect the Transaction set T:
Let Support_Threshold=0.6 & Confidence_Threshold = 0.6. As total_transactions=5, An item set needs to have a frequency of at-least 3(3/5=0.6) to be called a Frequent item set.
- Find all frequent 1-item-sets (subsets with one element that have Support>Support_Threshold). For this, we 1st need to calculate the frequency of the 1-item-set.
- Select the 1-item-sets that are frequent: The greyed values in the above table would be rejected from further consideration due to low Support (< 3).
- Next, to create a Candidate 2-item-set, we will be calculating F_(1)xF_(1). As it is an exception, we will be cross-joining all frequent 1-item-sets hence,
One thing you might observe is in the Candidate 2-item-set we didn’t consider {Diaper, Beer}, {Milk, Diapers}, etc. This is because {Diapers, Beer} is the same as {Beer, Diapers} and hence to avoid such repeated cases, we would be arranging elements of item-sets in some lexical order (say in alphabetical order as in the above table).
- Now, to generate Candidate 3-item-set, we will be going for F_(2)xF_(2). Hence, we will be merging the Frequent 2-item-set with the 1st element same.As only {Bread,Diapers} & {Beer,Milk} have common 1st element, they would be merged to form:
- Now, as we have just one rule in the Frequent 3-item-set, the Candidate 4-item-set isn’t possible. Hence, we are done !!
- If we have more item-set in the Frequent 3-item-set, we would have generated the Candidate 4-item-set by merging item sets with the first 2 items same. Like: if we had {A,B,C} , {A,B,D}, {A,C,E}, Candidate 4-item-set would have {A,B,C,D} ignoring {A,C,E}
So, Apriori has given us {Bread, Diapers, and Milk} as output.
But this is just the frequent item set. We still need Association rules.
How to generate rules from the Frequent item-set?
Now, if Apriori gives us a 100-item-set as output, even calculating Confidence can be slow. How can this phase be made faster? Using the below theorem
Theorem
If Confidence for X→Y-X < Confidence_Threshold, than even Xx →Y-Xx can be rejected where Xx ⊆X.
What does this mean?
Say {Bread,Beer} →{Diapers}-{Bread,Beer} doesn’t hold. Than {Bread} →{Diapers}-{Bread} can’t hold.
A couple of questions?
What’s the proof?
Confidence for any rule X →Y= Frequency(X U Y)/Frequency(X)
Now, if we have
X →Y-X=Frequency(X U Y -X)/Frequency(X)=F(Y)/F(X)
Xx →Y-Xx=Frequency(Xx U Y -Xx)/Frequency(Xx)=F(Y)/F(Xx)
But as Frequency(Xx)>Frequency(X) as Xx⊆X hence,
F(Y)/F(Xx)≤F(Y)/F(X) & if X →Y-X is rejected, the other is rejected as well.
Why would I be calculating X →Y-X rather than just X →Y i.e how is this theorem useful in our case?
Looking at the 1st greyed cell, as bcd →a is rejected, the rest of greyed cells(6 rules) are rejected. Taking one of them, cd →ab is rejected as cd⊆bcd & Confidence cd →ab = Frequency({c, d, a, b}-{c, d})/Frequency({c, d}). Hence we need to calculate Confidence for just one rule rather than 7 !!
= Frequency({a, b})/Frequency({c, d}) which would always be less than Frequency({a, b})/Frequency({b, c, d})
In a similar fashion as above, we can start with {Bread, Diapers, Milk} →{} & find Association rules with Confidence>Confidence_Threshold.
If you feel a bit confused using the above method for finding Association rules from a Frequent item set, you can use the Brute Force method which will calculate Confidence for each & every possible rule that can be formed from the Frequent item set obtained. Like as we have {Bread,Diapers, Milk}, the following rules are to be checked if using Brute Force:
{Bread} →{Diapers},{Diapers} →{Bread},
{Bread} →{Milk},{Milk} →{Bread},
{Milk} →{Diapers},{Diapers} →{Milk},
{Bread,Diapers} →{Milk},{Milk} →{Bread,Diapers},
{Bread} →{Diapers,Milk},{Diapers,Milk} →{Bread},
{Bread,Milk} →{Diapers},{Diapers} →{Bread,Milk}
And with this, its a wrap !!