3.4) Association Rule Mining using ECLAT Algorithm

Hello! This article is a part of a full tutorial of Recommendation Systems which you can find here. I strongly recommend to follow the agenda of this tutorial if you want to master this topic.

In the previous article we discussed the Apriori algorithm, now we will study the Eclat algorithm and main differences between them.

Types of Association Rule Mining Algorithms

Horizontal vs Vertical Data Format

Existing mining algorithm of association rules can be broadly divided into two main categories: — Horizontal format mining algorithms and Vertical format mining algorithms. We have a matrix which shows transactions with items, these kind of matrix can be represented a horizontal or a vertical way.

The most commonly used layout is the horizontal data layout. That is, each transaction has a transaction identifier (TID) and a list of items occurring in that transaction, i.e., {TID:itemset}. Another commonly used layout is the vertical data layout n which the database consists of a set of items, each followed by the set of transaction identifiers containing the item, i.e., {item:TID_set}. Table 1 shows the horizontal layout and table 2 shows the vertical layout:

Table 2. Horizontal Layout
Table 2. Vertical Layout

Apriori algorithm uses horizontal format while Eclat can be used only for vertical format data sets. A number of vertical mining algorithms have been proposed recently for association mining, which has shown to be very effective and usually outperform horizontal approaches.

The Intuition of ECLAT Algorithm

Eclat algorithm is data mining algorithm which is used to find frequent items. As we already know, some association rule mining algorithm uses horizontal data format and some of them uses a vertical data format for generation of frequent itemsets. Eclat can not use horizontal database. If there is any horizontal database, then we need to convert into vertical database.

This vertical approach of the ECLAT algorithm makes it a faster algorithm than the Apriori (but sometimes when intermediate results of vertical tid lists become too large for memory, thus affecting the algorithm scalability). In Apriori algorithm we need to scan database again and again for finding frequent itemsets, this limitation is reduced by using vertical dataset in Eclat. Eclat needs to scan the database only once.

While the Apriori algorithm works in a horizontal sense imitating the Breadth-First Search of a graph, the ECLAT algorithm works in a vertical manner just like the Depth-First Search of a graph which is usually more fast than Breadth-First search.

We already discuss Breadth-First Search in Apriori, but the itemset lattice can also be traversed in a depth-first manner as shown in the picture below. The algorithm can start from, say, node a and counts its support to determine whether it is frequent. If so the algorithm progressively expands to the next level of nodes i.e ab, abc, and so on, until an infrequent node is reached , say abcd. It then backtracks to another branch, say abce, and continues the search from here.

Generating candidate itemsets using the depth-first approach

Eclat uses a purely vertical transaction representation. No subset tests and no subset generation are needed to compute the support. Generally, Transaction Id sets which are also called as tidsets are used to calculate the value of Support value. The support of item sets is determined by intersecting transaction lists.

Eclat: Calculate support count for an itemset by intersecting tid-lists of two of its (k-1) subsets

In the first call of function, all single items or data are used along with their respective tidsets. Then the function is called recursive, in each recursive call, each item in tidsets pair is verified and combined with other item in tidsets pairs. This process is repeated until no candidate item in tidsets pairs can be combined.

The main advantage of the vertical format is support for fast frequency counting via intersection operations on transaction ids (tids) and automatic pruning of irrelevant data. The main problem with these approaches is when intermediate results of vertical tid lists become too large for memory, thus affecting the algorithm scalability.

Let’s see an example of how Eclat works on vertical database:

Note: Using Eclat we only count support, because we only have item-sets and their supports. While we are not creating the rules, we do not need to calculate the confidence.

  1. Vertical data format. It is important to note that an item should not be appear more than once in the same transaction and also the items are assumed to be sorted by lexicographical order in a transaction.
Here we have our transactions data represented in a vertical format in a right side{item:TID_set}

2. Here is the first call of function and all single items or data are used along with their respective tidsets. Each frequent itemset is marked with its corresponding support value. The support of an itemset is given by number of times the itemset appears in the transaction database. Let’s say min_support=2

2) Frequent 1-itemsets. We have itemsets with lenght=1 and their supports. Itemsets whose support ≤ min_support will be filtered out

3) Here is recursive call of function (now k=2) and itemsets are generated. Let’s say min_support=2. So all itemsets whose support<min_support will be filtered out.

3) Frequent 2-itemsets. Generate itemsets (k=2) from previous frequent item-sets

4)Here is recursive call of function (now k=3) and frequent itemsets are generated.

4) Frequent 4-itemset

This process repeats, with k incremented by 1 each time, until no frequent items or no candidate itemsets can be found.

The end result of Eclat algorithm is frequent item-sets with their support. We are not creating the rules with calculating confidence. We assume that frequent itemset is important and support is enough information to generate recommendations.

Advantages of Eclat:

  1. Since the Eclat algorithm uses a Depth-First Search approach, it consumes less memory than the Apriori algorithm
  2. The Eclat algorithm does not involve in the repeated scanning of the data in order to calculate the individual support values
  3. Eclat algorithm scans the currently generated dataset unlike Apriori which scans the original dataset

Eclat vs Apriori Use Case:

The Eclat algorithm is naturally faster compared to the Apriori algorithm while the size of data set is small or medium, in case of large dataset there is a chance that Apriori performs more faster, because intermediate Tidsets which are created in Eclat algorithm consumes more space in memory than Apriori and when we lave a large dataset intermediate results of vertical tid lists become too large for memory, thus affecting the algorithm scalability. So Eclat algorithm is better suited for small and medium datasets where as Apriori algorithm is used for large datasets.

Note: If you want to master different techniques of data mining and association rules in general you can read this articles: link0, link1 and link2.

Congratulations!

You have completed non-personalized recommendations and the final part is the implementation of Apriori and Eclat algorithms using Python which we will discuss in the next article. You can find the python codes here.

--

--