The PCY Algorithm and Its Friends

Dan Isaza
Weekly Data Science
13 min readJun 30, 2018

--

In previous posts, I’ve covered the Apriori Algorithm and hash functions in the context of data mining. If you’re unfamiliar with those concepts, I recommend taking a moment to read up on them.

https://unsplash.com/photos/AWYI4-h3VnM

The Apriori Algorithm finds frequent itemsets by making several passes over a dataset. In the first pass, individual items (sometimes called singletons) are counted, and those with high enough counts (or supports) are deemed to be frequent singletons. In the second pass over the data, only candidate pairs are counted. These pairs consist of two frequent singletons.

Most likely, the number of candidate pairs is considerably smaller than the number of possible pairs in the dataset. Even so, the number of candidate pairs can be quite high. This makes the second pass of the Apriori Algorithm very memory-intensive. In fact, it is often the most memory-intensive pass of the algorithm [1]. In these cases, reducing the memory requirements of the second pass would reduce the overall memory requirements for the algorithm.

The number of candidate pairs is closely related to the number of pairs that can be made using all of the frequent items. This is because any pair that consists of two frequent items is considered a candidate pair. That said, keep in mind that not all of these pairs will occur in the dataset. For example, even though {apple} and {banana} are both frequent items, it’s possible that the pair {apple, banana} will not appear in the dataset. Let’s call these pairs possible candidate pairs and reserve the term candidate pairs for the pairs that we actually count.

As a result, if we have N frequent items, the binomial coefficient N choose 2 gives us the number of possible candidate pairs. The plot below illustrates the number of possible candidate pairs as a function of the number of frequent items. We can see that the memory requirements will grow quite quickly as the number of frequent items increases, since the number of possible candidate pairs skyrockets.

Again, keep in mind that some of these pairs may not necessarily show up in our dataset, but the number of pairs we’ll count (i.e. candidate pairs) is directly proportional to the number of possible candidate pairs.

Possible Candidate Pairs vs. Frequent Items

Here’s a similar graph, but with a logarithmic scale on the y-axis:

Note the logarithmic scale on the y-axis

Relative to this, counting the number of frequent items doesn’t take a lot of memory, which means that there will be a lot of unused memory during the first pass of the Apriori Algorithm. Maybe it can be put to good use…

That’s exactly what Park, Chen, and Yu did in their Hash Based Algorithm for Mining Association Rules — named PCY for short, after the authors [1]. The algorithm uses hashing during the first pass to reduce the number of candidate pairs that are considered in the second pass.

A Key Intuition

Imagine that instead of getting the counts of each item in the dataset, we counted the number of items that started with each letter of the alphabet.

If the number of items that started with the letter ‘a’ was less than our support threshold, we would know that none of the items that start with the letter ‘a’ are frequent. For example, if apples had been frequent, then the count of items starting with ‘a’ would have been above our threshold.

This is the key insight at the heart of the PCY Algorithm. Instead of applying it to single items, we apply it to pairs — since pairs cause our memory bottleneck. But the principle is the same.

(Note: In practice we use a modified hash table, not the first letters of items — but hopefully the example helps build an intuition for the algorithm)

PCY with a Silly Hash Function

To understand how the PCY Algorithm works, we’ll run through it with a simplistic (and silly) hash function. Keep in mind that the PCY Algorithm is a slightly modified version of the Apriori Algorithm.

As we encounter each basket during the first pass, we keep track of the occurrences of each singleton. We also build pairs from the items in the basket, hash the pairs, and keep track of the number of times each hash value has appeared.

The mechanism we’ll use to keep track of these counts is a modified hash-table. Each bucket in the hash-table corresponds to one of the hashes and contains a single integer that represents the number of times something has hashed to that bucket.

Note that this is very different from keeping track of all of the pairs that have hashed to that bucket.

For this example, we’ll hash these pairs using a simplistic hash function.

Defining Our Hash Function

For each pair, we take note of the first letter in each item’s name and put them together in a sorted string. Then we keep a hash-table with the counts of all the occurrences of these sorted strings.

To illustrate this, consider the baskets below.

Let s = 3 be the support cutoff for being considered ‘frequent’.

Our dataset is:
Apples, Avocados, Bacon, Blueberries, Carrots
Almonds, Avocados, Bacon, Bananas, Durian
Almonds, Bacon
Avocado, Bananas
Almonds

We read in the first basket and update our counts for items. This is the first basket we’re seeing, so each of the items will have a count of 1 once we process this basket.

Next, we generate all of the possible pairs from the items in this basket:

The basket is:
Apples, Avocados, Bacon, Blueberries, Carrots

The possible pairs are:
{apples, avocados}
{apples, bacon}
{apples, blueberries}
{apples, carrots}
{avocados, bacon}
{avocados, blueberries}

and so on…

For each of these pairs, we take the first letter in each item and put them together into a string, which we sort alphabetically. This is our hash function, which is a bit silly but hopefully will help build intuition.

Here are some pairs and their corresponding hashes:

{apples, avocados} → ‘aa’
{apples, bacon} → ‘ab’
{apples, blueberries} → ‘ab’

{avocados, bacon} → ‘ab’

and so on…

Next, we keep track of how many times each hash has appeared. For example, after the first basket, the hash ‘aa’ will have a count of 1, and the hash ‘ab’ will have a count of 4.

For example, the pair {apples, bacon} hashes to ‘ab’, so when we encounter it, we’ll go to the bucket in the hash-table that corresponds to ‘ab’ and increase the count there from 0 to 1. The pair {apples, blueberries} also hashes to ‘ab’, so it’ll cause us to increase the count in that bucket from 1 to 2.

Note that each of these pairs has only appeared once so far, but the count in this bucket is 2, since both pairs hash to the same bucket.

Note that each of these pairs has only appeared once so far, but the count in this bucket is 2, since both pairs hash to the same bucket.

We repeat this for all of the baskets in the dataset, and at the end of the entire first pass through the dataset, we get the results shown below. (Remember, our support threshold for an itemset to be frequent is s=3).

Frequent Items:
Almonds, Avocados, Bacon

Counts of Pair Hashes:
‘aa’: 2
‘ab’: 10
‘ac’: 2
‘ad’: 2
‘bc’: 2
‘bd’: 2

What can we do with these hash counts?

You might notice that the hash ‘ab’ has a count above our support threshold. If we’re only looking at these bucket-counts, though, it’s hard for us to determine how we got to that count.

On the one hand, it could be that a single pair with the hash ‘ab’ appeared 10 times in the dataset. On the other hand, though, it could be that lots of different pairs all hashed to that bucket. Thus, the bucket could have a high count even if none of the pairs that hash to that bucket are frequent.

In contrast to this, the buckets with counts less than our support threshold are very important. Consider the bucket ‘ad’, for example, which has a count of 2. We don’t know whether a single pair that hashes to ‘ad’ appeared twice in the dataset or two different pairs that hash to this bucket each appeared once. But if any of the pairs that hash to this bucket were frequent, the bucket would have a count of 3 or more. That means that any pair that hashes to the bucket ‘ad’ must be an infrequent pair, since that bucket’s count is below our support threshold of 3.

…any pair that hashes to the bucket ‘ad’ must be an infrequent pair, since that bucket’s count is below our support threshold of 3.

For example, if the pair {apple, durian} had appeared 5 times in the dataset, then the bucket ‘ad’ would have a count of at least 5, since this pair hashes to the bucket ‘ad’. In other words, the bucket count represents the maximum possible number of times that any pair that hashes to that bucket can appear in the dataset.

This allows us to narrow our definition of a candidate pair during the second pass through the dataset, which significantly lowers the number of possible candidate pairs. In particular, we’ll require that a pair hashes to a frequent bucket in order to be a candidate pair, in addition to being made up of frequent items.

What Happens to the Hash-Table After the First Pass?

During the second pass, we only need to know whether a bucket in the hash-table is frequent — we don’t really care about the particular count. Since frequent vs. not frequent is binary, we can replace the count in the hash-table with a single 1 or 0. That way, we reduce the memory footprint of each bucket in the hash-table from a 4-byte int, which stored the count, to a 1 bit flag, which tells us whether that bucket is frequent. The resulting hash-table is often called a bitmap.

Hold on a second…

If you’re watching closely, you may have noticed that in the small dataset provided here, keeping track of these hash counts doesn’t actually help us in any way. The pairs that we would have filtered out using the infrequent buckets are already filtered out by the fact that the items they contain are not frequent. This is because our dataset is artificially small.

With large datasets — and a large number of items sold in the store — this is extremely unlikely to occur. In practice, the deciding factor on whether PCY provides an improvement over Apriori is the percentage of possible candidate pairs that are eliminated by the hashing technique. We’ll take a closer look at this in a moment.

A Quick Note on Our Hash Function

One important thing to keep in mind here is that the hash function we’ve chosen is very contrived. The biggest flaw with this hash function is that hashes don’t have equal probabilities of occurring. (For a refresher on hash functions, read Hash Functions for Data Mining, which describes some examples of hash functions we might use in production towards the end of the post.)

Trade-offs with Hash-Table Size

The main tradeoff faced when picking the size of the hash-table is between the memory footprint of the hash-table and the number of collisions that will be experienced.

Consider the case, for example, where the number of buckets in the hash-table is equal to the number of possible candidate pairs. In this case, the memory footprint of the hash-table is extremely large, and we may as well have just counted all of the candidate pairs.

On the other hand, if we make our hash-table extremely small — for example, a single bucket — we’ll experience a huge number of collisions, which is also not particularly useful.

The original paper explores hash-table sizes in more detail, and I encourage you to read it if you’re curious [1]. The online class Mining Massive Datasets, which has a corresponding textbook that is freely available through their website, also discusses the PCY algorithm in detail. The book claims that the break-even point for using the technique described in the PCY algorithm is when 2/3 of the possible candidate pairs are eliminated through the hashing process. I explore the math behind this claim in another post: Memory Usage of PCY vs. Apriori Algorithms

Friends of the PCY Algorithm: Multistage and Multihash

As the title of this post implies, there are several refinements that can be made to enhance the PCY algorithm — the two that we’ll explore here are the Multistage Algorithm and the Multihash Algorithm.

Both algorithms focus on applying the hash-based filtering from the PCY algorithm more than once in order to further reduce the number of possible candidate pairs.

The Multistage Algorithm

In the Multistage Algorithm, we start by conducting the first pass over the data just as we did in the PCY Algorithm. However, before moving onto counting candidate pairs, we conduct a second pass over the data and hash pairs into a second hash-table if and only if the following conditions are met:

  • The pair is made up of frequent items
  • The pair hashes to a frequent bucket in our first bitmap

But be careful: If we use the same hash function that we used to create the first hash-table, we won’t be gaining any new information. The second hash-table needs to use a hash function that is independent of the first hash-table’s. If we don’t introduce a new hash function, we’ll end up with the exact same hash-table as before, except that the infrequent buckets will have been removed.

If we don’t introduce a new hash function, we’ll end up with the exact same hash-table as before, except that the infrequent buckets will have been removed.

Once the second pass through the data is complete, we create a bitmap from the second hash-table. Together, the two bitmaps that we’ve created will filter out a greater number of pairs than the first bitmap alone.

During the third pass, we count pairs that are made up of frequent items and hash to frequent buckets in both bitmaps. The added bitmap effectively reduces our number of false positives.

Before counting candidate pairs, we could elect to make any number of passes over the data that we choose. The goal here would be to filter out more infrequent pairs and reduce the memory footprint of the algorithm.

However, keep in mind that the hash-tables and bitmaps also take up memory, and there will be a point at which it no longer makes sense to continue making passes over the data. Plus, we have to consider that each pass over the data takes time.

…keep in mind that the hash-tables and bitmaps also take up memory, and there will be a point at which it no longer makes sense to continue making passes over the data.

Can we hash things exclusively through the second (or nth) bitmap?

No, we can’t. The reason is that the second bitmap was constructed from only a subset of all possible pairs. Consider, for example, a pair that is made up of frequent items but does not hash to a frequent bucket in the first bitmap. Since it doesn’t hash to a frequent bucket in the first bitmap, we know it’s not a frequent pair. But consider what might happen if we put this pair through the second bitmap. The second bitmap is totally uncharted territory for it, since it played no role in the construction of this bitmap. We might put it through and find that — just by chance — it hashes to a frequent bucket. If we only hashed this pair through the second bitmap, we might accidentally think that it’s a candidate pair. Using both bitmaps ensures that we’ll filter it out.

If we keep track of the order in which the bitmaps were created, we can save ourselves a bit of time by hashing pairs through the 1st bitmap before deciding whether to hash it through the second bitmap. That way, if it’s filtered out as infrequent, we don’t waste time on subsequent bitmaps.

Remember: Bitmaps only provide information about the pairs that were used to construct them.

The Multihash Algorithm

Like the Multistage Algorithm, the Multihash Algorithm also uses multiple hash-tables to filter out infrequent pairs. However, rather than making additional passes over the data, this algorithm simply creates more than one hash-table during the first pass. These hash-tables turn into bitmaps, which are used to filter out infrequent pairs during the second pass.

The number of hash-tables that we use can be tuned depending on the context. Using more hash-tables will — up to a point — mean that we’ll filter out more infrequent pairs.

Adding hash-tables requires making a choice — we can either increase the amount of memory that we’re using, or we can decide not to increase our memory usage and simply divide up the memory among the hash-tables.

Keep in mind that splitting memory among many hash-tables means that the hash-tables get smaller, resulting in more collisions. If we have too many collisions, we won’t be able to effectively filter out infrequent pairs.

When looking at trade-offs, considering the extreme cases usually helps build intuition. For example, consider the case where we have lots and lots of hash-tables, each of which has only one bucket. Each hash-table would then contain the total count of all pairs in the dataset — which provides no help.

When looking at trade-offs, considering the extreme cases usually helps build intuition.

In Conclusion

In Data Mining, which algorithms we use and how we tune their parameters is very context-dependent. In this post we’ve seen that using the PCY Algorithm will only reduce our memory footprint in certain situations. We’ve also seen that the number of passes / hash-tables used in the Multipass and Multistage algorithms are highly context-dependent.

Implement, Implement, Implement!

These context-driven decisions take practice. To gain a deeper understanding of them, I highly encourage you to take a shot at implementing these algorithms yourself.

If you’re looking for datasets with which to practice, I recommend:

Let me know what you thought of this post in the comments below.

And don’t forget to subscribe to Data Science Weekly to see more content like this.

--

--

Dan Isaza
Weekly Data Science

Stanford Math & CS | VP of Engineering at Clever Real Estate | (he/him pronouns)