[Technology] — Three Probabilistic Data Structures Every Big Data Engineer Should Know
Hadoop & Spark have a number of techniques at hand to reduce complexity of space searching, counting, and global optimizing. While most of these techniques are already implemented into the framework, chances are you may want to take advantage of them to increase your query performance.
1. Bloom Filter
Bloom Filter is the abc of probabilistic space searching. It has been widely adopted to many search and query optimizer nowaday.
Problem: You want to search for a row that possibly reside in one of many data-segments(from a dataset) stored on disk, but to fully scan all the segments is too costly.
Idea: A BloomFilter can give you the probability of the row appearing in a specific segment, in O(1). If the probability is zero (which means the row surely doesn’t appear), you can skip searching that segment, and go on with the next.
If your data set is splitted in N segments, and B segments has zero possibility of containing a row. There will be only N-B seeks operations. Considering the size of each data segment fairly large, you will avoid significant innecessary seeking ops.
So how can BloomFilter do this?
It’s hash table like data structure using multiple hash functions for the same key but unlike hash tables it does not store the actual key value in a bucket it just marks all buckets for all hash functions applied to a key as used. Eventually, a “1” means “probably yes”, and a “0” means “absolutely no”.
In practice: HBase already provided BloomFilter to reduce the seek time of GET/SCAN row. MapReduce Join can also take advantage of BloomFilter to optimize the performance.
NOTE: You might need to be cautious of the memory a BloomFilter takes up, the larger it is, the more performance you achieve.
2. Count-Min Sketch
Problem: For real-time analytics, Count-Min is an approximate counting algorithm that is much more space efficient compared to the complete frequency table (one element for every unique element in the stream). Count-Min works reasonably well even if you don’t know the number of unique elements in advance.
Idea: Hash each incoming item several different ways, and increment a count for that item in a lot of different places, one place for each different hash.
When we try to find out the count for this popular item, we look in all the places it hashed to and take the minimum count that we find in any of them, hoping that no more popular item collided with it in all of those locations.
In practice: Count-Min Sketch has been proved to be a very useful technique for Counting Item Frequency with Spark Streaming. This means you can view the approximate TopPageVisit, Impression count in real-time fashion instead of batch behavior. I will write a sample with Spark Streaming counting in near future.
HyperLogLog is an algorithm for the count-distinct problem, which approximates the number of distinct elements in a multiset.
Problem: Given a multiset containing a large number of distinct items (e.g. > 10⁸), every time a new item arrived you will have to update the count of distinct items in the set.
The easy way is to store all the distinct items in memory(e.g. a HashTable), and check if each new item has already appeared. This method would be extremely inefficient in terms of memory when the number of distinct items scale to hundred millions.
HyperLogLog can solve this problem with only a few kilobytes of memory with error rate of 2%. 
Idea: Let’s start with a more simple problem, can you guess the total number of times a coin was flipped, given largest the number of heads you’ve got in a row? For example, if you spent some time flipping coins and got a run of 2 heads, I can tell you haven’t flipped much. If you got 100 heads in a row, then you must have been flipping coin for quite a time. Eventually, you can generalize the distribution of the number of heads in a row, with regards to the number of times the coin was flipped.
The basic idea behind HyperLogLog is pretty much the same, it divides data into buckets, and hashes bucket data into a base 2 sequences. The longest zero streak in each sequence is used to estimate the total count the dataset. Finally, averaging the estimated count from each bucket to reduce deviation.
However, some outlier hashed value can badly messed up the estimated count. HyperLogLog uses a method call HarmonicMean to throw out extreme values and maintain the average count.
In practice: HyperLogLog is prevalent when it comes to counting the number of distinct items in real-time analytics. You can find a case study of Spark Streaming with HLL here.