Types of Pattern-Mining Approaches with Spark

How to apply different types of pattern mining models with Spark — a PySpark guide to implementing frequent pattern mining

Pınar Ersoy
ANOLYTICS
4 min readDec 6, 2023

--

Source

For the last decade, mining frequently occurring items has often been among the initial steps to breaking down an enormous dataset, a research subject in big data and data mining. The majority of past investigations have adopted an apriori-like approach. On the other hand, developing a successor item set might be expensive, particularly when there are potentially lengthy or similar patterns.

Data & Pattern Mining in Information Systems on a Schema (Owned by the author)

The frequent tree algorithm can be acknowledged as a drawn-out prefix-tree structure for putting away compacted, essential data about successive examples and creating an effective FP-tree-based mining technique, FP-development, for mining the total arrangement of continuous examples by design part development. The effectiveness of mining is accomplished with three strategies in the following steps, as detailed in the paper of Han et al.

(i) An enormous amount of information is packed into a profoundly dense, a lot more modest information structure that stays away from exorbitant, rehashed data set filters
(ii) FP-tree-based mining accepts a pattern section implementation technique to stay away from the expensive age of countless applicant sets
(iii) The distributed, segment-based strategy is utilized to disintegrate the mining task into a bunch of more modest errands for mining restricted designs in contingent data sets, which emphatically decreases the inquiry space

The FP-development strategy is effective and versatile for mining both long and short continuous examples and is a significant degree quicker than the Apriori calculation and, furthermore, quicker than some of the more detailed and incessant example mining techniques of late.

1. FP-Growth (No Time-Sequenced)

Unique in relation to Apriori-like calculations intended for a similar reason, the second step of FP development utilizes a postfix tree (FP-tree) construction to encode exchanges without creating applicant sets unequivocally, which are generally costly to produce. After the subsequent advance, the continuous itemsets can be extricated from the FP tree. Under the machine learning library of Spark, an equivalent form of FP implementation is called the Parallel FP-Growth algorithm, which is described in detail in the article by Li et. al.

A sample script can be found below, starting from importing libraries to printing the output.

from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row

For trial purposes, a pseudo-data frame can be formed with a sample script as follows:

dataFrame_fp = spark.parallelize([Row(sequence=[[10, 22], [37]]),
Row(sequence=[[10], [22, 37], [10, 22]]),
Row(sequence=[[10, 22, 37],[22, 51], [2, 61]]),
Row(sequence=[[10, 22], [51]]),
Row(sequence=[[61]])]).toDF()

In the algorithm of FPGrowth, there are multiple parameters, including minSupport, minConfidence that return matching values or their default values. You may assign the parameters to the upcoming code snippet.

fp_growth_function = FPGrowth(minSupport=0.3, minConfidence=0.6)fp_growth_model = fp_growth_function.fit(dataFrame_fp)fp_growth_model.setPredictionCol("new_model_fpm")

When all the parameter assignments are finished, the model can be executed with the data frame, and the output can be composed as the following script:

fp_growth_model.freqItemsets.show()

fp_growth_model.associationRules.show()

2. PrefixSpan (Time-Sequenced)

Sequential pattern mining is a significant information mining issue with wide applications. In any case, it can be accepted as a tough issue since the mining might need to create or observe a combinatorially unstable number of moderate arrangements, which is explained in the paper of Pei et. al.

For the implementation, first of all, the required libraries shall be imported using the following script:

from pyspark.ml.fpm import PrefixSpan
from pyspark.sql import Row

After importing Spark MLlib’s PrefixSpan package, a sample data frame can be generated.

dataFrame = spark.parallelize([Row(sequence=[[1, 2], [3]]),
Row(sequence=[[1], [3, 2], [1, 2]]),
Row(sequence=[[1, 2, 3], [2, 5], [2, 6]]),
Row(sequence=[[1, 2], [5]]),
Row(sequence=[[6]])]).toDF()

In the algorithm of PrefixSpan, there are several parameters, such as minSupport, maxPatternLength, maxLocalProjDBSize that return corresponding values or their default values. You may assign the parameters to the upcoming code snippet.

prefix_span_function = PrefixSpan()prefix_span_function.setMinSupport(0.1)prefix_span_function.setMaxPatternLength(4)

When all the parameter assignments are completed, the function can be executed with the created data frame and print the output.

prefix_span_function.findFrequentSequentialPatterns(dataFrame).sort("sequence").show(truncate=False)

Conclusion

Spark MLlib supports several machine learning approaches. The frequent pattern mining algorithm FPGrowth contains two distinct algorithms, the FPGrowth and PrefixSpan packages of Spark MLlib. These two methods differ from each other by means of their dependency on timestamps.

In the event of a lack of time-related information for the items or products in the dataset, you may use FPGrowth. On the other hand, if the data frame includes time dependency, it would be beneficial to use the PrefixSpan methodology since this information may expose any hidden patterns with respect to the timeline.

Questions and comments are highly appreciated!

--

--

Pınar Ersoy
ANOLYTICS

Senior Lead Data Scientist @Dataroid, BSc Software & Industrial Engineer, MSc Software Engineer https://www.linkedin.com/in/pinarersoy/