Apriori Algorithm in Data Mining (Candidate Generation and Testing Approach)

3 min readJun 10, 2022

->Apriori algorithm is a core theory used in boolean association rules.

->It is a groundbreaking algorithm that analyzes the level-wise mining technique.

->This is a pioneering algorithm proposed for frequent item set mining.

->Apriori algorithm is an iterative technique that uses a horizontal data formatting approach.

->The property of the Apriori algorithm is as follows:

Downward Closure Property

The property for frequent patterns used in data mining is the downward closure property which says,

The subset within any frequent item set must be frequent.

For example, if the superset is {peanut butter, bread, jam} is frequent, then the subset {bread, jam} must be frequent as well.

That is, every transaction (in a convenience store) having {peanut butter, bread, jam}, must also have {bread, jam}.

Apriori algorithm practices a two-step method of generating and test strategy or typically referred to as join and prune.

The superset of an itemset that proves to be infrequent must not be generated.

The Apriori algorithm generates singletons, pairs, and triplets by pairing the items within the transactions.
Apriori algorithm uses candidate generation which is the joining step performed by combining itemsets of two transactions.

Data Collection: Collect the transactions, or itemsets to perform the apriori algorithm.
Prune: Eliminate itemsets that don’t seem to repeat within the transactions.
Join: Generate K (random number according to your liking) Candidate itemsets.
Test the candidate itemsets: Check whether the frequency of an itemset is above or below the support level. If it is above the support level, the itemset stays to be iterated in the next Join step. If the frequency is below the support level, the itemset flows into the prune step and it eliminate.
Frequent Itemsets: We obtain the frequent itemsets from the remaining itemsets that qualified the support level.
Terminate: When no candidate itemsets or frequent itemsets can be created, the algorithms should be stopped.

Let us cement this concept with the help of a diagram:

The obtained knowledge from the algorithm is intuitive and simple to comprehend.
The implementation of the algorithm taking into account huge datasets comprising of multiple itemsets is easier to implement.
Theoricially and problematically is straightforward.

The technique does not perform well with tiny datasets, and it is quite likely that erroneous correlations will be generated.
A complete scan for support checking of the dataset is required.
With an increase in the number of itemsets, the runtime grows exponentially.
Each and every candidate set must be stored in the memory, hence the memory is squandered.
If the dataset is large and the support level is poor, the apriori approach has a significant computing cost.

To overcome the numerous drawbacks, we must employ a variety of strategies, such as

Variations incorporating hashing and transactions can improve the algorithm’s efficiency.

Sampling and partitioning, on the other hand, will reduce the number of data scans to two or one.

These variations of the apriori algorithm as discussed in the next article.