Memory usage of PCY vs. Apriori Algorithms

Dan Isaza
Weekly Data Science
4 min readJun 30, 2018

The free textbook Mining Massive Datasets makes the claim that at least 2/3rds of possible candidate pairs must be eliminated by the first pass of the PCY Algorithm in order for its memory footprint to be lower than that of the Apriori Algorithm. Here, I’ll go step-by-step through that claim to show you why that’s the case.

https://unsplash.com/photos/1iVKwElWrPA

First, let’s consider the memory usage of the Apriori Algorithm during the second step, which according to the original PCY paper is typically the memory bottleneck for the algorithm.

Memory Footprint of the Apriori Algorithm (2nd pass)

The Apriori Algorithm uses a triangular matrix to keep track of how many times items co-occur in a basket. I’ve drawn out an example matrix and entered N/A in the spaces that we would not initialize.

Note that memory is not allocated for the bottom part of the matrix

Note that each row and column of the matrix corresponds to an item that is sold in the store. Let’s assume that we’re using a hashmap to keep track of which indices correspond to which items. All of our approaches will use this hashmap, so we’ll leave it out of our calculations.

Finding the Memory Footprint of the Triangular Matrix

The memory footprint of the triangular matrix will be:

(Number of entries) x (Size of each entry)

Each entry in the matrix is an integer. Since we’re dealing with a large dataset, we should assume that counts can get very large — so let’s use a 4-byte int to keep track of these counts.

The number of entries in the matrix is equal to the number of pairs that we can make out of all the frequent items. Note that (n choose 2) gives us that number, where n is the number of frequent items.

Thus, the total memory footprint of the triangular matrix is:

(n choose 2) x 4 bytes

Let’s compare this to the memory footprint of the PCY Algorithm.

Memory Footprint of PCY Algorithm

In the PCY Algorithm, infrequent pairs are eliminated using a hashing technique. It would be super convenient of all of these pairs happened to be at one end of the triangular matrix that we just described. However, that’s not the case. Before running the first pass of the algorithm, there’s no way for us to know which pairs will be eliminated as infrequent.

Moreover, if we drew out our triangular matrix and removed the infrequent pairs, we’d see what looks like a triangular slice of swiss cheese — the missing entries would be randomly spread out throughout the matrix.

…if we drew out our triangular matrix and removed the infrequent pairs, we’d see what looks like a triangular slice of Swiss cheese…

Unfortunately, there’s no way to avoid allocating memory for these infrequent pairs in the triangular matrix. Obviously, creating a full triangular matrix doesn’t provide a smaller memory footprint than the Apriori Algorithm, so we have to use a different approach.

We can keep track of the counts of pairs using tuples that look like this:

(item, item, count)

where each item is a 4-byte int, which corresponds to one of the items sold in the store — as tracked in our previously mentioned hashmap. Here, the couts are 4-byte ints, just as they were in the triangular matrix. Thus, each of the pairs we count will take up a total of 12 bytes of memory.

So the total memory usage for counting pairs in the PCY Algorithm will be:

(Number of pairs counted) x (12 bytes)

Remember, we’re only going to be counting pairs that are comprised of frequent items and hash to frequent buckets. Let m represent the number of pairs we count in the second pass of PCY.

So, what’s the break-even point?

Let’s set up their memory usages as an equation and solve it to get the break-even point.

Remember:

  • n is the number of frequent items
  • m is the number of pairs counted in the PCY algorithm

Using the memory usages outlined above, we can set up the equation:

(n choose 2) x 4 bytes = m x 12 bytes

Dividing both sides by 12 bytes yields:

(n choose 2) x (1/3) = m

Thus, we can see that the break-even point for memory usage is reached when the number of pairs we count in the PCY Algorithm is equal to one third of the number of pairs counted in the Apriori Algorithm.

This is equivalent to saying that we need to eliminate at least 2/3 of the pairs counted in the second pass of Apriori, which is the claim made by the book, so we’ve verified the claim.

Nice! 🎉

Want to Learn More?

Check out some of these posts for more insights:

Other Resources

I highly encourage you to read the original PCY paper, as well as check out Mining Massive Datasets.

Let me know what you thought of this post by leaving a comment below!

And don’t forget to subscribe if you want to see more content like this.

--

--

Dan Isaza
Weekly Data Science

Stanford Math & CS | VP of Engineering at Clever Real Estate | (he/him pronouns)