Apriori Algorithm in Data Mining: Part 01

Concept of Apriori Algorithm for Frequent Itemset Mining and Association Rule Generation

Dushanthi Madhushika
LinkIT
5 min readMay 18, 2021

--

Photo by Franki Chamaki on Unsplash

Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it. Data Mining is extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)information or patterns from data in large databases.

Apriori is a relational database algorithm for frequent itemset mining and association rule learning. The basic process of the Apriori Algorithm is identifying the appearing of frequent individual items in the dataset and extending the item sets as long as they appear sufficiently often in the dataset. The algorithm terminates when no further successful extensions are found.

To improve the efficiency of level-wise generation of frequent itemsets, an important property is used called Apriori property or Apriori Principle.

Apriori Principle : Any subset of frequent itemset must be frequent.

The main two steps followed by this algorithm are Joining and Pruning.

Let’s understand the algorithm through the following example.

Consider the following dataset for the example. Take minimum support = 2 and minimum confidence = 70%.

Dataset — Image by Author

Step 1: Generate one-item set frequent pattern

Create a table that includes each item and corresponding support count of them (C1).

C1— Image by Author

Next, compare each candidate support count with a minimum support count (2). As all the support count values are not less than the minimum support count following will be the results (L1).

L1 — Image by Author

In the first iteration of the algorithm, each item is a member of the set of candidates.

Step 2: Generate two-items set frequent pattern.

For this algorithm use, L1 join L1 to generate a candidate set of two item sets, C2. Then find out the support count for each item set.

Check all subsets of an item set are frequent or not. If not frequent, remove that itemset. (Example subset of {A, B} are {A}, {B} they are frequent. Check for each itemset)

C2 — Image by Author

After finding C2 compare each candidate itemset support to count with minimum support count (2). Remove the item sets that are less than the minimum support count.

L2 — Image by Author

Step 3: Generate three-items set frequent pattern.

Here we use Apriori Property for the generation of the candidate set of three itemsets (C3). For the generation of C3 compute L2 join L2. Condition of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common. So here, for L2, first element should match.

C3 = {{A, B, C}, {A, B, E}, {A, C, E}, {B, C, D}, {B, C, E}, {B, D, E}}

Now the Join step is complete and then move on to the Prune step to reduce the size of the C3.

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

Check if all subsets of these item sets are frequent or not. If not, then remove that itemset. (Here subset of {A, B, C} are {A, B}, {A, C}, {B, C} which are frequent. For {A, C, E}, subset {C, E} is not frequent so remove it. Similarly, check for every itemset)

So, the C3 will be as follows.

C3 = {A, B, C}, {A, B, E}

C3 — Image by Author

After comparing each candidate support count with minimum support count (2). As all the support count values are not less than the minimum support count L3 is the same as the C3.

Step 4: Generate four-items set frequent pattern.

For the generation of C4 compute L3 join L3. After the Join step is complete move on to the Prune step to reduce the size of the C4.

Check all subsets of these item sets are frequent or not. Here itemset formed by joining L3 is {A, B, C, E}. It contains {A, C, E} which is not frequent. So, no itemset in C4.

The algorithm terminates as no frequent itemsets are found further.

Following the last step, we have found all the possible frequent item sets from the given dataset.

What’s next?

It is time to generate strong association rules from the frequent items found. For that, we must calculate the confidence of each rule.

Confidence = Support_Count(Item set) / Support_Count(Subset)

Step 5: Generate the Association rule from frequent itemsets.

L = {{A}, {B}, {C}, {D}, {E}, {A, B}, {A, C}, {A, E}, {B, C}, {B, D}, {B, E}, {A, B, C}, {A, B, E}}

Let’s take {A, B, E}

All non-empty subsets of selected frequent item set are,

{A}, {B}, {E}, {A, B}, {A, E}, {B, E}

As mentioned in the beginning, we consider the minimum confidence to be 70%.

R1 -> {A} ^ {B} => {E}; 
confidence = sup_c({A, B, E})/sup_c({A, B}) = 2/4 * 100 = 50%
R2 -> {A} ^ {E} => {B};
confidence = sup_c({A, B, E})/sup_c({A, E}) = 2/2 * 100 = 100%
R3 -> {B} ^ {E} => {A};
confidence = sup_c({A, B, E})/sup_c({B, E}) = 2/2 * 100 = 100%
R4 -> {A} => {B} ^ {E};
confidence = sup_c({A, B, E})/sup_c({A }) = 2/4 * 100 = 50%
R5 -> {B} => {A} ^ {E};
confidence = sup_c({A, B, E})/sup_c({B}) = 2/6 * 100 = 33%
R6 -> {E} => {A} ^ {B};
confidence = sup_c({A, B, E})/sup_c({E}) = 2/2 * 100 = 100%

So, the rules that are greater than the minimum confidence will be selected. Selected rules for {A, B, E} item set are R2, R3, R6.

Pretty cool right?

Now you understand how the Apriori Algorithm works to find association rules hidden in a dataset.

Let’s discuss implementation of Apriori Algorithm in part 02 of this article.

Thank you for reading so far and I hope you learned something. If you enjoy my article, make sure to hit the clap button.

Happy Mining😜!

--

--

Dushanthi Madhushika
LinkIT
Writer for

Tech enthusiast. A Graduate at Faculty of Information Technology University of Moratuwa