ML Algorithms: One SD (σ)- Association Rule Learning Algorithms

An intro to machine learning association rule learning algorithms

Sagi Shaier
Mar 18 · 5 min read

The obvious questions to ask when facing a wide variety of machine learning algorithms, is “which algorithm is better for a specific task, and which one should I use?”

This is one section of the many algorithms I wrote about in a previous article. In this part I tried to display and briefly explain the main algorithms (though not all of them) that are available for association rule learning tasks as simply as possible.

Association rule learning:

· Apriori

To start with, the name of this algorithm is Apriori because it uses prior knowledge of frequent itemset properties. It is useful in mining frequent itemsets (a collection of one or more items) and relevant association rules. You usually use this algorithm on a database containing a large number of transactions. For example, the items customers buy at a supermarket. The Apriori algorithm reduces the number of candidates with the following principle: If an itemset is frequent, ALL of its subsets are frequent.

Some definitions:

Example:

Now generate Association rule:

To find the confidence:

Some things to consider:

It has great significance in data mining.

The resulting rules are intuitive and easy to communicate to an end user.

It is easy to implement.

It doesn’t require labeled data as it is fully unsupervised.

It can be very slow.

If the dataset is small it can find many false associations that happened by chance.

It requires many database scans.

· ECLAT (Equivalence Class Transformation)

The algorithm scans the dataset and finds itemsets that occur more frequently in the transaction than a given threshold. The biggest difference from the Apriori algorithm is that it uses Depth First Search instead of Breadth First Search. In the Apriori algorithm, the element based on the product is used, but in Eclat algorithm, the transaction is passed on by the elements. ECLAT improves Apriori in the step of Extracting frequent itemsets (Apriori has to scan the data multiple times, but ECLAT don’t need to).

Some things to consider:

Apriori use large data set while Eclat use small and medium data set.

Aprioir is slower than Eclat.

· FP (Frequent Pattern) Growth

FP growth algorithm is an improvement of Apriori algorithm. Similarly to Apriori, it helps perform a market basket analysis on transaction data. Basically, it’s trying to identify sets of products that are frequently bought together. Both Apriori and FP have the same input and the same output: The input is a transaction database and a minimum support threshold. The output is the set of itemsets having a support no less than the minimum support threshold. The difference between these algorithms is how (the process) they generate the output.

To compare the two: Apriori uses a level-wise approach (it will generate patterns containing 1 items, then 2 items, 3 items, etc.). Moreover, it will recurrently scan the database to count the support of each pattern. FP Growth though uses a depth-first search instead of a breadth first search and uses a pattern-growth approach (meaning that unlike Apriori, it only considers patterns actually existing in the database).

FP uses a compressed representation of the database using an FP-tree. Once it builds the tree, it uses a recursive divide-and-conquer method to mine the frequent itemsets. FP-Growth is preferred to Apriori because Apriori takes much more execution time for repeated scanning of the transaction dataset to mine the frequent items.

Example:

Some things to consider:

Studies have shown that it is an order of magnitude faster than Apriori.

No candidate generation.

Only two passes over dataset.

FP tree may not fit in memory.

FP tree is very expensive to build.

Until next time,

Bobcat.

Sagi Shaier

Written by

Recent Summa Cum Laude graduate in Computational and Applied Math, minor in Statistics, Pre-Med track. My sights are set on helping people through Data Science.