Data Streams Mining

Published in

The Startup

18 min readJul 21, 2020

1- Introduction

The fact that the pace of technological change is at its peak, Silicon Valley is also introducing new challenges that need to be tackled via new and efficient ways. Continuous research is being carried out to improve the existing tools, techniques, and algorithms to maximize their efficiency. Streaming data has always remained a challenge since the last decades, nevertheless plenty of stream-based algorithms have been introduced but still, the researchers are struggling to achieve better results and performance. As you know that, when water from a fire hose starts hitting your face, chances to measure it starts decreasing gradually. This is due to the torrent nature of streams. It has introduced new challenges of analyzing and mining the streams efficiently. Stream analysis has been made easy up to some extent because of a few new tools that are introduced in the market recently. These tools are following different approaches and algorithms which are being improved continuously. However, when it comes to mining data streams, it is not possible to store and iterate over the streams like traditional mining algorithms due to their continuous, high-speed, and unbounded nature.

Due to irregularity and variation in the arriving data, memory management has become the main challenge to deal with. Applications like sensor networks cannot afford mining algorithms with high memory cost. Similarly, time management, data preprocessing techniques, and choice of the data structure are also considered as some of the main challenges in the stream mining algorithms. Therefore, summarization techniques derived from the statistical science are dealing with a challenge of memory limitation, and techniques of the computational theory are being used to improve the time and space-efficient algorithms. Another challenge is the consumption of available resources, to cope with this challenge resource-aware mining is introduced which makes sure that the algorithm always consumes the available resources with some consideration.

As data stream is seen only once therefore it requires mining in a single pass, for this purpose an extremely fast algorithm is required to avoid problems like data sampling and shredding. Such algorithms should be able to run with data streams in parallel settings partitioned to many distributed processing units. Infinite data streams with high volumes are produced by many online, offline real-time applications and systems. The update rate of data streams is time-dependent. Therefore to extract knowledge from streaming data, some special mechanism is required. Due to their high volume and speed, some special mechanism is required to extract knowledge from them.

Many stream mining algorithms have been developed and proposed by machine learning, statistical and theoretical computer science communities. The question is, how should we know which algorithm is best in terms of dealing with current challenges as mentioned above, and what is still needed in the market? This document intends to answer these questions. As this research topic is quite vast therefore deciding the best algorithm is not quite straightforward. We have compared the most recently published versions of stream mining algorithms in our distribution which are classification, clustering, and frequent itemset mining. Frequent itemset mining is a category of algorithms used to find the statistics about streaming data.

2- Classification

The classification task is to decide the proper label for any given record from a dataset. It is a part of Supervised learning. The way of the learning works is to have the algorithm learn patterns and important features from a set of labeled data or ground truths resulting in a model. This model will be utilized in the classification tasks. There are various metrics used to rate the performance of a model. For example, Accuracy, in which the focus of this metric is to maximize the number of correct labels. There is also, Specificity in which the focus is to minimize mislabelling negative class. There are few factors that are crucial to deciding which metrics are to be used in classification tasks, such as label distributions and the purpose of the task itself.

There are also a few types in the Classification Algorithm, such as Decision Trees, Logistic Regression, Neural Networks, and Naive Bayes. In this work, we decide to focus on Decision Tree.

In Decision Tree, the learning algorithm will construct a tree-like model in which the node is a splitting attribute and the leaf is the predicted label. For every item, the decision tree will sort such items according to the splitting attribute down to the leaf which contained the predicted label.

2.1 Hoeffding Trees

Currently, Decision Tree Algorithms such as ID3 and C4.5 build the trees from large amounts of data by recursively select the best attribute to be split using various metrics such as Entropy Information Gain and GINI. However, existing algorithms are not suitable when the training data cannot be fitted to the memory.

There exist few incremental learning methods in which the learning system, instead of fitting the entire data-sets at once in memory, continuously learning from the stream of data. However, it is found that those model lack of correctness guarantee compared to batch learning for the same amount of the data.

Domingos and Hulten [1] formulated a decision tree algorithm called the Hoeffding Tree. With Hoeffding Tree, the record or training instance itself is not saved in the memory, only the tree nodes and statistics are stored. Furthermore, the most interesting property of this tree is that the correctness of this tree converges to trees built using a batch learning algorithm given sufficient massive data.

The training method for this tree is simple. For each sample, sort it to the subsequent leaf and update its statistic.

There are two conditions that must be fulfilled in order for a leaf to be split

1. There exists impurity in the leaf node. That is, not every record that is stored on the leaf has the same class.

2. The difference of the result of the evaluation function between the best attribute and second-best attribute denoted ∆G is greater than E, where E is

Where R is the range of the attribute, δ (provided by the user) is the desired probability of the sample not within E and n is the number of collected samples in that node.

In the paper, it is rigorously proven that the error of this tree is bounded by Hoeffding Inequality. Another excellent property of this tree is that even though we reduce the error rate exponentially, we only need to increase the sample size linearly.

2.2 VFDT Algorithm

Domingos and Hutten further introduced a refinement of Hoeffding Tree called VFDT (Very Fast Decision Tree). The main idea is the same as Hoeffding Tree.

The refinements are

•Ties VFDT introduced an extra parameter τ. It is used when the delta between the best and the second-best attribute is too similar

•G computation Another parameter introduced is nmin, which denotes the minimum number of samples before G is recomputed. That means the computation of the G can be deferred instead of every time a new sample arrives, which reduces global times resulted from frequents calculation of G

•Memory VFDT introduces a mechanism of pruning the least promising leaf from the Tree whenever the maximum available memory already utilized. The criterion used to determine whether a leaf is to be prune is the product of the probability of a random example that will go to it denoted as pl and its observed error rate el. The leaf with the lowest that criteria value will be considered as the least promising and will be deactivated.

•Dropping Poor Attributes Another approach to have a performance boost is to drop attributes that are considered not promising early. If the difference of its evaluation function value between an attribute with the best attribute is bigger than E, then that attribute can be dropped. However, the paper doesn’t explain what is the exact parameter or situation for an attribute to be dropped.

•Initialization The VFDT can be bootstrapped and combined by an existing memory-based decision tree to allow the VFDT to accomplish the same accuracy with a smaller number of examples. No detailed algorithm is provided, however.

2.3 Hoeffding Adaptive Trees

One of the fallacies in Data Mining is the assumption that the distribution of data remains stationary. This is not the case, consider data from a kind of supermarket retails, the data can change rapidly in each different season. Such a phenomenon is called concept drift.

As a solution, Bifet et al [2] proposed a sliding window and adaptively based enhancements to Hoeffding Tree. Furthermore, the algorithm to build such a tree is based on the authors’ previous work, ADWIN (Adaptive Windowing) Algorithm which is a parameter-free algorithm to detect and estimates changes in Data Stream.

In building adaptive learning algorithm, needs to be able to decides these three things

• What are things that need to be remembered?

• When is the correct time to upgrade the model?

• How to upgrade the model?

Therefore there is a need for a procedure that is able to predict and detect changes in Data Distribution. In this case, is served by ADWIN algorithm mentioned before.

The main idea of Hoeffding Adaptive Tree is that aside from the main tree, alternative trees are created as well. Those alternative trees are created when distribution changes are detected in the data stream immediately. Furthermore, the alternate tree will replace the main tree when it is evidence that the alternate tree is far more accurate than the main tree. Ultimately, the changing and adaptation of trees are happening automatically judged from the time and nature of data instead of prior knowledge by the user. Note that, having said that it still retains in principle the algorithm to build and split the tree according to Hoeffding bound, similar to VFDT.

In experiments, the authors mainly compared the Hoeffding Adaptive Tree with CVFDT (Concept Adapting Very Fast Decision Tree). CVFDT itself is formulated by the same authors of VFDT, it is basically VFDT with an attempt to include concept drift. In terms of performance measured with an error rate, using a synthetically generated dataset with a massive concept change, the algorithm managed to achieve a lower error rate quickly compared to CVFDT i.e. faster adaption to other trees. In addition, it managed to lower memory consumption by half. However, the drawback is that this algorithm consumes the longest time, 4 times larger than CVFDT.

3 Clustering

Clustering is to partition a given set of objects into groups called clusters in a way that each group have similar kind of objects and is strictly different from other groups. Classifying objects into groups of similar objects with a goal of simplifying data so that a cluster can be replaced by one or few representatives is considered as a core of the clustering process. Clustering algorithms are considered as tools to cluster high volumes datasets. We have selected three latest clustering algorithms and compared them with others based on a performance metric i.e. efficient creation of clusters, the capability to handle a large number of clusters, and the chosen data structure.

3.1 Stream KM++ Clustering Algorithm

Stream KM++ clustering algorithm is based on the idea of k -MEANS++ and Lloyd’s algorithm (also called k -MEANS algorithm) [3]. Lloyd’s algorithm is one of the famous clustering algorithms. Best clustering in Lloyd’s algorithm is achieved by assigning each point to the nearest center in a given sent of centers (fixed) and MEAN of these points is considered as the best center for the cluster. Also, k -MEANS++ serves as a seeding method for Lloyd’s algorithm. It gives a good practical result and guarantees a quality solution. Both algorithms are not suitable for the data streams as they require random access to the input data.

Def: Stream KM++ computes a representative small weighted sample of the data points (known as a coreset) via a non-uniform sampling approach in one pass, then it runs k -MEANS++ on the computed sample and in a second pass, points are assigned to the center of nearest cluster greedily. Non-uniform sampling is a time-consuming task. The use of coreset trees has decreased this time significantly. A coreset tree is a binary tree that is associated with hierarchical divisive clustering for a given set of points. One starts with a single cluster that contains the whole set of points and successively partitions existing clusters into two sub-clusters, such that points in one sub-cluster are far from the points in another sub-cluster. It is based on merge and reduces technique i.e. whenever two samples with the same number of input points are detected, it takes the union of these points in the merge phase and produces a new sample in the reduced phase which uses coreset trees.[4]

In comparison with BIRCH (A top-down hierarchical clustering algorithm), Stream KM++ is slower but in terms of the sum of squared errors, it computes a better solution up to a factor of 2. Also, it does not require trial-and-error adjustment of parameters. Quality of StreamLS algorithm is comparable to Stream KM++ but running time of Stream KM++ scales much better with the number of cluster centers than StreamLS. Stream KM++ is faster on large datasets and computes solutions that are on a par with k -MEANS++.

3.2 Adaptive Clustering for Dynamic IoT Data Streams

A dynamic environment such as IoT where the distribution of data streams changes overtime requires a type of clustering algorithm that can adapt according to the flowing data. Many stream clustering algorithms are dependent on different parameterization to find the number of clusters in data streams. Determining the number of clusters in the unknown flowing data is one of the key tasks in clustering. To deal with this problem, an adaptive clustering method is introduced by P. B. Daniel Puschmann and R. Tafazoll in this research paper [5]. It is specifically designed for IoT stream data. This method updates the cluster centroids upon detecting a change in the data stream by analyzing its distribution. It allows us to create dynamic clusters and assign data to these clusters by investigating data distribution settings at a given time instance. It is specialized in adapting to data drifts of the data streams. Data drift describes real concept drift (explained in 2.3) that is caused by changes in the streaming data. It makes use of data distribution and measurement of the cluster quality to detect the number of categories which can be found inherently in the data stream. This works independently of having prior knowledge about data and thus discover inherent categories.

A set of experiments has been performed on synthesized and intelligent live traffic data scenarios in this research paper [5]. These experiments are performed using both adaptive and non-adaptive clustering algorithms and results are compared based on cluster’ quality metric (i.e. silhouette coefficient). The result has shown that the adaptive clustering method produces clusters with 12.2 percent better in quality than non-adaptive.

In comparison to Stream KM++ algorithm explained in 3.1, it can be induced that Stream

KM++ is not designed for evolving data streams.

3.3 PERCH-An Online Hierarchical Algorithm for Extreme Clustering

The number of applications requiring clustering algorithms is increasing. Therefore, their requirements are also changing due to the rapidly growing data they contain. Such modern clustering applications need algorithms that can scale efficiently with data size and complexity.

As many of the currently available clustering algorithms can handle the large datasets with high dimensionality, very few can handle the datasets with many clusters. This is also true for Stream Mining clustering algorithms. As the streaming data can have many clusters, this problem of having a large number of data points with many clusters is known as an extreme clustering problem. PERCH (Purity Enhancing Rotations for Cluster Hierarchies) algorithm scales mildly with high N (data points) and K (clusters), and thus addresses the extreme clustering problem. Researchers of the University of Massachusetts Amherst published it in April 2017.

This algorithm maintains a large tree data structure in a well efficient manner. Tree construction and its growth are maintained in an increment fashion over the incoming data points by directing them to leaves while maintaining the quality via rotation operations. The choice of a rich tree data structure provides an efficient (logarithmic) search that can scale to large datasets along with multiple clustering that can be extracted at various resolutions. Such greedy incremental clustering procedures give rise to some errors which can be recovered using rotation operations.

It is being claimed in [6] that this algorithm constructs a tree with the perfect dendrogram purity regardless of the number of data points and without the knowledge of the number of clusters. This is done by recursive rotation procedures. To achieve scalability, another type of rotation operation is also introduced in this research paper which encourages balance and an approximation that enables faster point insertions. This algorithm also possesses a leaf collapsing mode to cope with limited memory challenge i.e. when the dataset does not fit in the main memory (like data streams). In this mode, the algorithm expects another parameter which is an upper bound on the number of leaves in the cluster tree. Once the balance rotations are performed, the COLLAPSE procedure is invoked which merges leaves as necessary to meet the upper bound.

In comparison with other online and multipass tree-building algorithms, perch has achieved the highest dendrogram purity in addition to being efficient. It is also competitive with all other scalable clustering algorithms. In comparison with both type of algorithms which uses the tree as a compact data structure or not, perch scales best with the number of clusters K. In comparison with BIRCH, which is based on top-down hierarchical clustering methods in which leaves of each internal node are represented by MEAN and VARIANCE statistics, and these node statistics are used to insert points greedily and there are no rotation operations performed, it has been proved that BIRCH constructs worst clustering as compared to its competitors. In comparison with Stream KM++, it shows that coreset construction is an expensive operation and it does not scale to the extreme clustering problem where K is very large.

PERCH algorithm has been applied on a variety of real-world datasets by writers of this research paper and it has proven as correct and efficient. [6]

4 Frequent Itemset mining

Frequent Itemset Mining refers to mine a pattern or item that appears frequently from a dataset. Formally, assume there exist a set I comprising of n distinct items {i1, i2, . . . , in}. A subset of it X, X⊆I is called a pattern. The source of data to be mined is transactions. If a pattern is a subset of a transaction denoted t, X⊆t, then it is said X occurs in t. A metric for Frequent Item Mining is called support. Support of a pattern is the number of how many transactions in which that pattern occurs. For a natural number min sup, given as a parameter, any pattern in which support is greater or equal to it is called a frequent pattern.

One of the most famous data structures for Frequent Itemset Mining is FP-Tree [7]. However, FP-Tree requires multiple scanning of item databases, something that is very costly for fast-moving data streams. An ideal algorithm should have a one-pass like property to function optimally.

Common recurrent property in Data Stream Mining is the utilization of window models. According to Jin et al, there are three types of window model [8]

1. Landmark window In this window model, the focus is to find frequent itemsets from a starting time a to time point b. Consequently, if a is set to 1, then the Mining Algorithm will mine the entire data stream

2. Sliding window From a time point b and given the length of the window a, the algorithm will mine item from time b − a + 1 and b. In other words, it only considers item that enters our window stream at a time

3. Damped window model In this model, we give more weight to newly arrived items. This can be done simply by assigning a decaying rate to the itemsets. 1, t]

4.1 Mining Maximal Frequent Itemsets in Data Streams Based on FP- Tree

This work [9] introduces a new algorithm FpMFI-DS, which is an improvement of FpMFI (frequent pattern tree for maximal frequent itemsets) [10] algorithm for the data stream. FpMFI itself is an algorithm to compress the FP-Tree and to check the superset pattern optimally.

FmpMFI-DS is designed to store the transactions in Landmark Window or Sliding Windows. The consequence of adapting Windows for mining is that the FP-Tree needs to be updated when the transaction is out of the window. This is done by tidlist a list of transactions’ ID and a pointer to the ultimate node of the transaction in the tree. Other important details of FpMFI-DS are due to the requirement of having a one-pass algorithm, instead of ordering items in FP-Tree with its frequency, it is done lexicographically.

A further improvement of FpMFI-DS is the introduction of a new technique called ESEquivPS. In ESEquivPS. From an experiment by the authors, the size of the search space can be reduced by about 30%.

4.2 Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach

In this work [11], Karim et al describes how to build a tree-like structure to mine Maximal Frequent Pattern effectively. Maximal Frequent Patterns refers to patterns with a maximal number of items, that is: it should not have any superset patterns.

For example, assume that in our transaction database, there are three patterns AB, BC, and ABD with the occurrences of 7, 5, and 3. If we decide that the minimum support is 2, all of them are frequent patterns. However, AB is not a maximal frequent pattern, since it is a subset of ABD which is a frequent pattern.

The author utilized prime numbers for having faster computation and lower memory computation. The idea is that each distinct item from the database is represented as a distinct prime number. A transaction is represented as the multiplication of the prime number representing each item in that transaction which is called Transaction Value. From these formulations, there are few interesting properties.

1. A huge number of possible distinct items For a 32-bit integer, the biggest prime number is 105097565 thus theoretically we can represent around 100 million different items. However, the computation of Transaction Value may result in Integer Overflow, thus class like BigInteger is used.

2. No two different transactions have the same Transaction Value. Since the Transaction Value is the product of prime numbers, it is trivial to show that every Transaction Value should be unique and bijective.

3. Greatest Common Divisor to denote common item If δ is the GCD of the Transaction Value of a transaction α and the Transaction Value of a transaction β, we can get the common items from those two transactions by factoring δ

With the Transaction Value of the transaction, a tree-like structure called ASP-tree is constructed. Inside this structure, the Transaction Value and its count is preserved. Furthermore, the tree contains the following invariants

1. Every node α is a descendant direct or indirect of all nodes in which TV value is a multiple of TV of α.

2. The count of each node is the total support of the transaction represented by its TV

The authors also introduce the MFPAS algorithm to generate the Maximal Frequent Itemsets from the ASP-tree. The algorithm simply scans the tree bottom-up and do necessary pruning to get the relevant Transaction Value to be decoded to a real list of items. Interestingly, all information to get the frequent itemset are available on the tree without a need to scan the database.

The procedure is suitable for either Batch or Data Stream environment. The authors include a Spark Implementation for this procedure. It is also shown that the differences between Batch or Data Stream lie only on using correct Spark API i.e. use Spark Stream API when doing works on stream data, while the proposed algorithm remains intact.

5 Summary Table

6 Conclusion

In this report, we have conducted a survey of recent streaming data algorithms. Each algorithm is explained briefly along with key points and comparisons with other algorithms of the same class. In the end, we have presented a summary table with a crux of all the algorithms explained. We found out that recently introduced algorithms have solved the data problems (e.g. concept drift, data shredding, and sampling) and few of the main challenges (e.g. Memory limitation and data structure) which were considered as drawbacks of algorithms a few years back. As the wheel of advancement has no destination, we expect further evolution in data streams mining algorithms, opening research lines for further developments.

References

[1] P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’00. New York, NY, USA: ACM, 2000, pp. 71–80. [Online]. Available: http://doi.acm.org/10.1145/347090.347107

[2] A. Bifet and R. Gavald`a, “Adaptive learning from evolving data streams,” in Advances in Intelligent Data Analysis VIII, N. M. Adams, C. Robardet, A. Siebes, and J.-F. Boulicaut, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 249–260.

[3] D. Arthur and S. Vassilvitskii, “k-means++ : the advantages of careful seeding.” 2007. [Online].

Available: https://arxiv.org/abs/1704.01858

[4] K. S. Marcel R. Ackermann Christoph Raupach Christiane Lammersen Christian Sohler, Marcus M¨artens, “Streamkm++: A clustering algorithm for data streams,” 2010. [Online].

Available: https://pdfs.semanticscholar.org/0286/a1efab096db135715ee99e6ce122174b9f3e.pdf

[5] P. B. Daniel Puschmann and R. Tafazolli, “Adaptive clustering for dynamic IoT data streams,” 2016. [Online]. Available: http://dx.doi.org/10.1109/JIOT.2016.2618909

[6] A. K. Ari Kobren, Nicholas Monath, and A. McCallum, “An online hierarchical algorithm for extreme clustering,” 2017. [Online]. Available: https://arxiv.org/abs/1704.01858

[7] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,”

SIGMOD Rec., vol. 29, no. 2, pp. 1–12, May 2000. [Online]. Available: http:

//doi.acm.org/10.1145/335191.335372

[8] R. Jin and G. Agrawal, “Frequent pattern mining in data streams,” in Data Streams. Springer, 2007, pp. 61–84.

[9] F. Ao, Y. Yan, J. Huang, and K. Huang, “Mining Maximal Frequent Itemsets in Data Streams Based on FP-Tree,” in Machine Learning and Data Mining in Pattern Recognition, P. Perner, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 479–489.

[10] Y. Yue-jin, L. Zhou-jun, and C. Huo-wang, “Efficiently mining of maximal frequent itemsets based on fp-tree.”

[11] M. R. Karim, M. Cochez, O. D. Beyan, C. F. Ahmed, and S. Decker, “Mining maximal frequent patterns in transactional databases and dynamic data streams: A spark-based approach,” Information Sciences, vol. 432, pp. 278–300, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S002002551731126X