Apriori Algorithm

6 min readFeb 10, 2022

Large quantities of data are collected from businesses' day-to-day operations in the retail industry. For example, customer purchases in a grocery store, purchases from Amazon, purchases from a bookstore, or watching films on Netflix. These kinds of data are called market basket transactions and many companies rely on market basket analysis to produce meaningful product recommendations.

To uncover interesting relationships hidden in a large dataset is known as Association Analysis. The main goal here is to detect which items are frequently purchased together by customers to be able to develop marketing strategies and increase sales and profit. Here is the important point is that implication between products means co-occurrence, not causality.

We need to know some terminology before diving into the algorithm.

· Itemset is a collection of one or more items. k-itemset is an itemset that contains k items.

· Support Count is the frequency of occurrence of an itemset.

· Support is a fraction of transactions that contain an itemset. It refers to how often a given rule appears in the database being mined. In many instances, you may want to look for high support in order to make sure it is a useful relationship. However, there may be instances where low support is useful if you are trying to find “hidden” relationships.

· Confidence is a fraction value that shows how frequently the rule head occurs among all the groups containing the ruling body. Confidence is a measure of the reliability of the rule.

· Lift is the ratio of the observed support to that expected if the two rules were independent.

· Frequent Itemset is an itemset whose support is greater than or equal to a minsup threshold.

Association Rule is an implication expression of form X to Y where X and Y are itemsets. You can see the rule and its component’s definitions in the diagram below.

Given a set of transactions T, the goal of association rule mining is to find all rules having

– support ≥ minsup threshold

– confidence ≥ minconf threshold

Generally, Association Rule Mining can be viewed in a two-step process:

1. Frequent Itemset Generation

– Generate all itemsets whose support >= minsup

2. Rule Generation

– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset.

Frequent itemset generation is computationally expensive because in a classical approach (Brute-force approach); you need to list all possible association rules, then compute the support and confidence for each rule, and then prune rules that fail the minsup and minconf thresholds. This approach is prohibitively expensive because there are exponentially many rules that can be extracted from a data set.

For example; Given d unique items:

– Total number of itemsets = 2^d

– The total number of possible association rules:

Here comes the Apriori Algorithm into play. Apriori principle briefly says that if an itemset is frequent, then all of its subsets must also be frequent.

Apriori principle holds due to the following property of the support measure:

That means;

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support

An illustration of the apriori principle

If {c,d,e} is frequent, then all subsets of this itemset are frequent. Conversely, if an itemset such as {a,b} is infrequent, then all of its supersets must be infrequent too and can be pruned immediately as shown in the below figure.

The apriori algorithm uses support-based pruning to control the exponential growth of candidate itemsets. Initially, every item is considered as a candidate 1-itemset. (k=1) and frequent itemsets of length 1 are generated. In the second step, (k+1) length itemsets are generated from length k frequent itemsets. Then prune candidate itemsets containing subsets of length k that are infrequent. Count the support of each candidate by scanning the database and eliminating candidates that are infrequent, leaving only those that are frequent. Repeat this process until no new frequent itemsets are identified.

For example;

Consider the following dataset and we will find frequent itemsets and generate association rules for them.

minimum support count is 2
minimum confidence is 60%

Step-1: K=1

Create a table containing the support count of each item present in the dataset — Called C1(candidate set)

Compare candidate set item’s support count with minimum support count (here min_support=2 if support_count of candidate set items is less than min_support then remove those items). This gives us itemset L1.

Step-2: K=2

Generate candidate set C2 using L1 (this is called join step). The condition of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that itemset. (Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
Now find support count of these itemsets by searching in the dataset.

Compare candidate (C2) support count with minimum support count (here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L2.

Step-3:

Generate candidate set C3 using L2 (join step). The condition of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common. So here, for L2, the first element should match.
So itemset generated by joining L2 is {I1, I2, I3} {I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset. (Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly, check for every itemset)
Find support count of these remaining itemset by searching in the dataset.
Compare candidate (C3) support count with minimum support count(here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L3.

Step-4:

Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further

The Apriori Algorithm is easy to implement but it requires many database scans. So, it can be really slow for large datasets. To be able to implement it in Python, MLxtend library is very useful and easy to implement. I highly recommend you visit and give a try to python implementation of apriori algorithm in link https://goldinlocks.github.io/Market-Basket-Analysis-in-Python/

Conclusion

In this article, we covered the basics of association analysis and the apriori algorithm which can be very helpful for the retail industry. By analyzing the past buying behavior of customers, companies can find out which are the products that are bought frequently together by customers, and they can increase their sales by cross-sales.

Sources :

https://www.saedsayad.com/association_rules.htm

https://pbpython.com/market-basket-analysis.html

https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-market-basket-analysis/

https://goldinlocks.github.io/Market-Basket-Analysis-in-Python/

https://www.geeksforgeeks.org/apriori-algorithm/

Tan, Steinbach, Kumar, Introduction to Data Mining, 2006

Apriori Algorithm

Conclusion

Sources :

Written by Ufuk Saraç