Probabilistic data structures in the Big data world (+ code)

Probabilistic data structures are becoming ever so important in the realm of big data and streaming applications. Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising(including a recent project of mine: Reelbid). Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with engines like MapReduce, spark, Flink etc. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases.

On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly.

Comparing with error-free approaches, these algorithms use much less memory and have constant query time. They usually support union and intersection operations and therefore can be easily parallelized. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact — sometimes astonishingly compact — replacement of raw data in stream-based computing. I recently wrote an article on streaming algorithms, and this articles is highly related to same class of problems; or I should say- if sampling is not an option, then consider Hashing and using such probabilistic DS. As we move thru our data, there are some metrics that we are trying to compute; we can approximate those with some error bounds. Hence, a trade-off with the accuracy/error vs system resources.

Example: How many items are here? About 1523425 with probability of 99%

Wikipedia shows about 13 different algorithms. The 5 commonly used and discussed in this article are: Bloom filter, HyperLogLog, and Count-Min sketch, t-Digest and MinHash.

For these data structures, I’ll be using 2 pieces of text corpus which includes random text. Let’s also clean, tokenize and lowercase our corpus:

Let’s start with the HyperLogLog: for counting elements. I’ll install & use the python implementation from github. HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.

The main idea here is that when you have a very large collection of things, counting becomes a problem. In Python, the long integers have unlimited precision, so you can count really large sets as long as you have a lot of memory to use. What if you don't want to use up your application's memory? For example, imagine needing to compare the number of followers on Twitter. Do you really care whether Lady Gaga has 50,000,000 or 50,000,100 followers? The difference is noise anyway, so why waste lots of application memory? What if there are many celebrities to count and compare? Instead, approximate.

Next, let’s see Bloom filter for set membership(& pretty useful in NLP). Whether something exists in a set or not. One interesting property of a BloomFilter is that false negatives are not possible. So the main idea here is that we can load a set, then test elements to see if they are included in the set. Here an example of common words bw those 2 text corpuses:

Next, let’s see MinHash for similarity. Here we see a function given a list of words and computing the jaccard similarity index for the 2 texts we have. 0.271484 estimated vs 0.240602 actual. Not much similarity between the 2 texts.

Next let’s look at Count-min sketch. Wiki:

the frequency count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions.

It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary:

Frequency summaries; examples: keeping track of let’s say 95 percentile, in a leaderboard, top customers in ecommerce, fraud classifiers etc. One practical example would be in language detection: comparing frequencies for the most commonly occurring words is a simply way to predict the language in which a document was written. O(k) query time.

Streaming Quantiles

Imagine that you have a large stream of data. Perhaps you have an application that must measure that stream continually, where stopping to count simply isn’t an option. Suppose you’re monitoring ecommerce transations, trying to detect credit card fraud? Some measure become important (e.g., average order price). Approximating metrics on a stream is a great way to build applications that are more robust and operate in real-time. Here’s we’ll look at t-digest, where we can measure and look at 5, 50 and 95 percentile:


Hopefully this post gave you an idea on probabilistic data structures and how extremely useful and memory efficient they can be. These methods for approximation present a very different way of thinking about how to build applications and architectures for Big Data. For example, one immediate outcome is that we can query different time ranges of data by composing the hashes. Rather than have one SQL query for day/day metrics, then another SQL query for week/week metrics, simply compose the hashed approximations for the seven days of the week. That implies much less databeing stored, and significantly less computation required for reporting.

“Hash, don’t sample” — Oscar Boykin(twitter)


— You can find the entire code in 1 notebook here:

— Two of the better introductions to the math involved here are:

— For code in Scala, the Algebird library by Avi Bryant and Oscar Boykin provides an excellent framework for working with approximation algorithms.

— BlinkDB is a project that leverages a variety of approximation techniques for accelerating SQL queries on data at scale, and has been incorporated into Spark SQL.

— Also Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman, and associated with that there’s a highly recommended bi-annual conference MMDS.

— DZone Article on the same topic

Like what you read? Give Mudit a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.