Published in

HyperBlogBlog

# Probabilistic Data Structure Use Cases

## It’s a challenge to work with big data and still get answers about it quickly.

You may have heard about probabilistic data structures which summarize data to help you get fast approximate answers, but it takes some reading to understand the types of questions they help you answer, the guarantees they give you, and the situations where they break down. This article explains a selection of probabilistic data structures to serve as a starting point for investigating probabilistic solutions to big data problems.

# Set Membership (AMQ)

Probabilistic set membership data structures help you determine if an element is in a set using just a small summary of the set, an operation known as an Approximate Membership Query (AMQ). Several structures only differentiate between “maybe in the set” vs “definitely not in the set” and are called filters. There are also structures that differentiate between “definitely in the set” and “maybe in the set”.

Filters are useful for performing pre-checks before expensive operations like database lookups. A filter produced from keys in a database table can tell you quickly if a key does not exist and no database read is required, saving time. Structures that identify when keys are definitely present can be used analogously for database writes — i.e. there is no need to write to a database a value that certainly already exists there — which is helpful when you need to keep databases in sync.

## Bloom Filter

A Bloom filter allows you to insert elements and check if an element has maybe or definitely not been inserted into the filter. It also supports merging and estimating cardinality. Deleting elements is accomplished using a variation (counting Bloom filters) or in some situations a second Bloom filter that stores deleted elements. As the filter fills, accuracy degrades i.e. the filter starts giving you “maybe in the set” answers for every element you check. The filter’s size is O(n) in its capacity.

Parameters:

• m: Size; bigger values consume more memory but increase capacity. Optimal m = -n*ln(p) / (ln(2)²) where n is your anticipated number of elements added and p is your acceptable false-positive rate.
• k: Number of hash functions used internally; affects memory-capacity trade-off. Optimal k = m/n * ln(2).

Operations:

• Insert: add an element to the filter
• Contains: check if an element is maybe in the set or definitely not in the set
• Remove: remove an element from the filter
• Merge: produce a filter which represents the union of two sets
• Cardinality: estimate the number of elements that have been inserted

## Cuckoo Filter

A cuckoo filter allows you to insert elements and check if an element has maybe or definitely not been inserted into the filter. Like Bloom filters, it also supports estimating cardinality and removing elements, but unlike Bloom filters, it does not support merging. As the filter fills, performance degrades rather than accuracy, and inserts start to possibly fail. The filter’s size is O(n) in its capacity. The best implementations of Cuckoo Filters slightly outperform the best implementations of Bloom Filters under specific circumstances.

Parameters:

• Size: bigger values consume more memory but increase capacity

Operations:

• Insert: add an element to the filter
• Remove: remove an element from the filter
• Contains: check if an element is maybe in the set or definitely not in the set
• Cardinality: estimate the number of elements that have been inserted

## Quotient Filter

A quotient filter allows you to insert elements and check if an element has maybe or definitely not been inserted into the filter. Quotient filters also support resizing in addition to estimating cardinality, removing elements, merging. As the filter fills, performance degrades rather than accuracy and the filter needs to be resized. The filter’s size at any point is O(n) in its capacity.

Parameters:

• Size: bigger values consume more memory but increase capacity; can be changed

Operations:

• Insert: add an element to the filter
• Contains: check if an element is maybe in the set or definitely not in the set
• Remove: remove an element from the filter
• Merge: produce a filter which represents the union of two sets
• Resize: change the size of the filter
• Cardinality: estimate the number of elements that have been inserted

# Key-Value Lookup

## Invertible Bloom Lookup Table

An Invertible Bloom lookup table is an associative data structure that allows you to insert key-value pairs and check if a key has definitely or maybe been inserted into the filter and, if the key is definitely in the filter, get its associated value. Because invertible Bloom lookup tables can check that a key definitely exists, they are distinguished from Bloom, cuckoo, and quotient filters even when not storing values with keys (a key-only invertible Bloom lookup table is just called an invertible Bloom filter).

Like the other filters we’ve discussed, Invertible Bloom lookup tables support insertions, lookups, removals, and cardinality estimates, but in addition they support listing the elements in the filter, which facilitates resizing and allows diffing. Like the Bloom filter, this filter’s size is O(n) in its capacity with accuracy degrading as the filter fills. Listing the elements is also subject to the accuracy degradation — a full filter will possibly not be able to list elements, or will just list a handful before failing, but a filter that’s over-filled and then has enough elements removed will still be able to list not-removed elements which is helpful for diffing large datasets using memory proportional to their diff.

Parameters:

• Size: bigger values consume more memory but increase capacity; can be changed

Operations:

• Insert: add an element to the filter
• Contains: check if an element is maybe in the set or definitely not in the set
• Remove: remove an element from the filter
• Merge: produce a filter which represents the union of two filters
• Resize: change the size of the filter
• List: enumerate all the elements in the filter
• Cardinality: estimate the number of elements that have been inserted

# Set Similarity

A number of applications rely on fast estimates of the degree of similarity between two sets. Torrent clients can send summaries of the set of parts of a file that they already have and their peers can use it to check if they have the missing pieces of that file. When trying to sync databases using memory proportional to their diff, one can use an appropriately-sized invertible Bloom lookup table, but the appropriate size for the lookup table needs to be estimated ahead of time.

## MinHash

MinHash (also known as the min-wise independent permutations locality sensitive hashing scheme) is a technique that estimates the ratio of the size of the intersection of two sets to the size of the union of those sets, which falls between 0 (0-size intersection, completely different sets) and 1 (union-sized intersection, completely identical sets). The premise of the technique is that we sample elements from both sets in a way that they would be the same if the sets were the same, then estimate the difference between the sets as the difference between the samples. MinHash is not accurate for estimating small differences between big sets.

Parameters:

• k: Number of hash functions used internally; increases size and accuracy. Think of this as the sample size.

Operations:

• Insert: add an element to the data structure
• Merge: produce a MinHash sketch which represents the union of two MinHash sketches
• Difference Cardinality: estimate the size of the difference between two sets

## Strata Estimator

The strata estimator is a technique that estimates the size of the difference of two sets in absolute terms, unlike MinHash which measures in relative terms. The strata estimator is larger and slower to use than MinHash but is accurate for small relative differences, which it’s designed for (it’s described in section 3.2 of a paper on set reconciliation). The premise of the technique is to store a collection of invertible Bloom lookup tables with increasingly large samples of one set, then remove from each what would be equivalent samples from the other set, and find which table had a small enough sample to be able to list all the not-removed elements in the filter without loss.

Operations:

• Insert: add an element to the data structure
• Delete: remove an element from the data structure (even if it doesn’t exist). Think of this as adding an element to the other set.
• Difference Cardinality: estimate the size of the difference between the set of inserted and the set of deleted elements

# Cardinality

While several statistical data structures mentioned so far support estimating the number of elements added, they do so with memory proportional to the number of elements. Specialized structures exist for counts with constant space requirements and accuracy that decreases with the number of elements counted but increases with the allocated size.

Probabilistic counting data structures often support merging to produce a structure that estimates the count of the union of their sets. To compute the size of the intersection of two sets, you can estimate their individual sizes, sum them, then subtract the size of their union. Computing the size of intersections of more than two sets is complicated with exponentially increasing time complexity and error compounding.

## HyperLogLog++

HyperLogLog++ allows you to count distinct elements in a probabilistic data structure, merge structures, and estimate distinct elements counted. HyperLogLog++ was developed at Google and builds on HyperLogLog, which builds on LogLog, which builds on the Flajolet–Martin algorithm, and it has accumulated optimizations at several levels along the way. The premise of the algorithm is to hash each element and estimate the distinct element count based on the largest number of trailing zeroes in any hash (a higher max zero count means more elements). Google has documentation with confidence intervals for its accuracy.

Operations:

• Insert: add an element to the data structure
• Merge: produce a structure which represents the union of two structures
• Cardinality: estimate the number of elements that have been inserted

## K-Minimum Values

K-minimum values, like HyperLogLog++, allows you to count distinct elements, merge structures, and estimate distinct elements counted. It’s less accurate than HyperLogLog++ but it supports an intersection-style merge in addition to a union-style merge. Estimates on intersections lose accuracy the smaller the intersection is, but are faster and more accurate than the best approaches using HyperLogLog++. The premise of the algorithm is to hash the elements of the set and estimate the distinct element count based on the smallest hashes (smaller smallest hashes means more elements). There’s also a variant that counts occurrences of each element for all your multiset counting needs.

Operations:

• Insert: add an element to the data structure
• Union Merge: produce a structure which represents the union of two structures (possibly increasing accuracy)
• Intersection Merge: produce a structure which represents the intersection of two structures (possibly decreasing accuracy)
• Cardinality: estimate the number of elements that have been inserted

# Counting

Algorithms in the previous section counted the total number of elements added, but there also exist algorithms which count the number of times each of the distinct elements was added.

## Count-Min Sketch

A count-min sketch allows you to insert elements, merge sketches, and count the number of insert of an element. Count-min sketches can provide an upper bound on the number of times an element was inserted as well as an unbiased estimate. The premise of the sketch is to hash elements a few ways, increment counters for each hash (other elements can increment the same counters, resulting in overestimation), then estimate the the number of times it was added as the minimum of the counters it incremented.

Operations:

• Insert: add an element to the data structure
• Merge: produce a structure which represents the union of two structures
• Cardinality: estimate the number of times a particular estimate was inserted

--

--

--