Probabilistic Data Structures Simplified

Anik
15 min readJun 11, 2023

--

Probabilistic data structures use randomized algorithms to provide approximate solutions to problems, with a defined level of error or probability. They are designed to optimize memory usage and improve query performance.

They are often used for large datasets where exact results are not required or are too expensive to compute, and approximate results with a certain error rate are acceptable.

Here are some popular types of probabilistic data structures, followed by their simplified implementations in JavaScript and some common use cases.

1. Bloom Filter: A Bloom filter is a space-efficient data structure used to test whether an element is a member of a set. It can have false positives but no false negatives.

class BloomFilter {
constructor(size) {
this.size = size;
this.storage = {};
}

// Simple hash functions
hash1(value) {
return value % this.size;
}

hash2(value) {
return (value * 7) % this.size;
}

// Add an element to the Bloom filter
add(value) {
this.storage[this.hash1(value)] = true;
this.storage[this.hash2(value)] = true;
}

// Check if an element is in the Bloom filter
contains(value) {
return Boolean(
this.storage[this.hash1(value)] && this.storage[this.hash2(value)]
);
}
}

// Usage example
const bloomFilter = new BloomFilter(100);
bloomFilter.add(10);
console.log(bloomFilter.contains(10)); // true
console.log(bloomFilter.contains(20)); // false (or true if there's a false positive)

Pros: Can store additional information beyond membership, efficient space utilization.
Cons: Higher false positive rate, limited support for deletion.

Use Cases:

  • Web caching: Web caching systems like Squid use Bloom filters to check if a requested URL is already cached, avoiding unnecessary disk lookups.
  • Database systems: Bloom filters can be used to speed up query processing in databases by quickly determining if a specific record exists in a large dataset without scanning the entire dataset.
  • Spell checking: Bloom filters can be used to store a dictionary of correctly spelled words, allowing for fast spell checking by querying the filter for a given word.

2. Count-Min Sketch: A Count-Min sketch is a probabilistic data structure used to count the frequency of elements in a data stream with a small probability of overestimation.

class CountMinSketch {
constructor(width, depth) {
this.width = width;
this.depth = depth;
this.storage = {};
}

// Simple hash function
hash(value, seed) {
return (value * seed) % this.width;
}

// Increment the count of an element
add(value) {
for (let i = 0; i < this.depth; i++) {
const j = this.hash(value, i + 1);
this.storage[i] ??= {};
this.storage[i][j] ? this.storage[i][j]++ : (this.storage[i][j] = 1);
}
}

// Estimate the count of an element
count(value) {
let min = Infinity;
for (let i = 0; i < this.depth; i++) {
min = Math.min(min, this.storage[i]?.[this.hash(value, i + 1)] || 0);
}
return min;
}
}

// Usage example
const countMinSketch = new CountMinSketch(100, 5);
countMinSketch.add(10);
console.log(countMinSketch.count(10)); // 1 (or higher if there's a collision)
console.log(countMinSketch.count(20)); // 0

Pros: Efficient estimation of item frequencies, low memory footprint.
Cons: May have slight overestimation of frequencies, not suitable for precise frequency counting.

Use Cases:

  • Network traffic analysis: Count-Min sketches can be used to track the frequency of IP addresses or other network entities in a data stream, helping to identify heavy hitters or potential DDoS attacks.
  • Text analysis: Count-Min sketches can be used to estimate the frequency of words or phrases in a large text corpus, enabling applications like trending topic detection or word cloud generation.
  • Recommendation systems: Count-Min sketches can be used to track user-item interactions in a recommendation system, allowing for efficient computation of item popularity or user similarity.

3. HyperLogLog: HyperLogLog is a probabilistic data structure used for cardinality estimation. It provides an estimate number of distinct elements in a large dataset with a certain error rate.

class HyperLogLog {
constructor(precision) {
this.precision = precision; // Number of bits used for counting leading zeros
this.registers = new Uint8Array(2 ** precision); // Array of registers
}

add(element) {
const hash = this.hash(element); // Compute the hash value of the element
const index = this.getIndex(hash); // Compute the index in the registers array
const rank = this.getRank(hash, this.precision); // Compute the rank (number of leading zeros) of the hash value
if (rank > this.registers[index]) {
// If the rank is greater than the current register value, update the register
this.registers[index] = rank;
}
}

count() {
const alpha = this.getAlpha(this.precision); // Get the alpha correction factor based on the precision
const sum = this.registers.reduce((accumulator, register) => {
// Compute the sum of 1/(2^register) for all registers
return accumulator + 1 / 2 ** register;
}, 0);
const estimate = alpha * (1 / sum); // Compute the estimated cardinality based on the alpha factor and sum

// Apply small range corrections
if (estimate <= 2.5 * this.registers.length) {
const zeroCount = this.registers.filter(
(register) => register === 0
).length;
// If there are non-zero registers, apply linear counting correction
if (zeroCount !== 0) {
return Math.round(
this.linearCounting(this.registers.length, zeroCount)
);
}
}

// Apply large range corrections
if (estimate > (1 / 30) * 2 ** 32) {
return Math.round(-(2 ** 32) * Math.log(1 - estimate / 2 ** 32));
}

return Math.round(estimate); // Return the rounded estimated cardinality
}

hash(element) {
// Implement a suitable hash function (e.g., murmurhash3, xxHash)
// Convert the element to a unique hash value
// For simplicity, we use a simple string hash
let hash = 0;
for (let i = 0; i < element.length; i++) {
hash = (hash << 5) - hash + element.charCodeAt(i);
hash |= 0;
}
return hash;
}

getIndex(hash) {
// Compute the index in the registers array based on the hash value
return hash >>> (32 - this.precision);
}

getRank(hash, bitCount) {
// Compute the rank (number of leading zeros) of the hash value, considering the bitCount
const mask = (1 << bitCount) - 1;
return this.countLeadingZeros(hash & mask) + 1;
}

countLeadingZeros(num) {
// Count the number of leading zeros in a number
if (num === 0) return bitCount;
let count = 0;
while ((num & 0x80000000) === 0) {
num <<= 1;
count++;
}
return count;
}

getAlpha(bitCount) {
// Get the alpha correction factor based on the bitCount
const alphaMap = {
4: 0.673,
5: 0.697,
6: 0.709,
7: 0.715,
8: 0.718,
9: 0.72,
10: 0.721,
};
return alphaMap[bitCount] || 0.7213 / (1 + 1.079 / (1 << bitCount));
}

linearCounting(registerCount, zeroCount) {
// Apply linear counting correction to estimate the cardinality
return registerCount * Math.log(registerCount / zeroCount);
}
}

// Create a new HyperLogLog with a precision of 10 bits
const hll = new HyperLogLog(10);

// Add elements to the HyperLogLog
hll.add('apple');
hll.add('orange');
hll.add('banana');
hll.add('apple');

// Estimate the cardinality
console.log('Estimated Cardinality:', hll.count()); // Output: Estimated Cardinality: 3

Pros: High accuracy in estimating cardinality, minimal memory requirement.
Cons: Cannot retrieve individual items, limited to cardinality estimation.

Use Cases:

  • Analytics: HyperLogLog can be used to estimate the number of unique visitors to a website or the number of distinct elements in a data stream, providing valuable insights for analytics and reporting.
  • Database systems: HyperLogLog can be used to estimate the cardinality of large datasets, helping to optimize query planning and execution in database systems.
  • Data mining: HyperLogLog can be used to estimate the number of distinct items in large datasets, enabling efficient processing of tasks like frequent item set-mining or clustering.

4. Cuckoo Filter: Cuckoo Filters are probabilistic data structures that provide an efficient solution for approximate set membership queries. They are similar to Bloom Filters but offer additional features such as deletion support and a higher load factor and are particularly useful in scenarios where space efficiency, membership testing, and deletions are essential.

class CuckooFilter {
constructor(capacity, bucketSize, fingerprintSize) {
this.capacity = capacity;
this.bucketSize = bucketSize;
this.fingerprintSize = fingerprintSize;
this.buckets = new Array(capacity);

// Initialize buckets
for (let i = 0; i < capacity; i++) {
this.buckets[i] = new Array(bucketSize).fill(null);
}
}

hash(value) {
const hash = require('crypto').createHash('sha256');
const input = typeof value === 'string' ? value : value.toString();
hash.update(input);
return parseInt(hash.digest('hex'), 16);
}

// Compute the fingerprint of an element using a hash function
computeFingerprint(element) {
// Convert element to string
const str = String(element);

// Use a hash function to compute the fingerprint
// Here, we simply take the lower 'fingerprintSize' bits of the hash value
const hash = this.hash(str);
const fingerprint = hash & ((1 << this.fingerprintSize) - 1);

return fingerprint;
}

// Get the fingerprint and index of the bucket for an element
getFingerprintAndIndex(element) {
const fingerprint = this.computeFingerprint(element);
const index1 = this.hash(element) % this.capacity;
const index2 = (index1 ^ this.hash(fingerprint)) % this.capacity;

return { fingerprint, index1, index2 };
}

// Insert an element into the Cuckoo Filter
insert(element) {
const { fingerprint, index1, index2 } =
this.getFingerprintAndIndex(element);

// Try to insert in the first bucket
for (let i = 0; i < this.bucketSize; i++) {
if (!this.buckets[index1][i]) {
this.buckets[index1][i] = fingerprint;
return true; // Insertion successful
}
}

// Try to insert in the second bucket
for (let i = 0; i < this.bucketSize; i++) {
if (!this.buckets[index2][i]) {
this.buckets[index2][i] = fingerprint;
return true; // Insertion successful
}
}

// Perform eviction if both buckets are full
const evictionIndex = Math.random() < 0.5 ? index1 : index2;
const evictedFingerprint =
this.buckets[evictionIndex][Math.floor(Math.random() * this.bucketSize)];
this.buckets[evictionIndex][Math.floor(Math.random() * this.bucketSize)] =
fingerprint;

// Retry inserting the evicted element recursively
return this.insert(evictedFingerprint);
}

// Check if an element may be present in the Cuckoo Filter
contains(element) {
const { fingerprint, index1, index2 } =
this.getFingerprintAndIndex(element);

// Check in the first bucket
for (let i = 0; i < this.bucketSize; i++) {
if (this.buckets[index1][i] === fingerprint) {
return true; // Element may be present
}
}

// Check in the second bucket
for (let i = 0; i < this.bucketSize; i++) {
if (this.buckets[index2][i] === fingerprint) {
return true; // Element may be present
}
}

return false; // Element is definitely not present
}
}

// Usage example:
const cuckooFilter = new CuckooFilter(1000, 4, 8); // Capacity: 1000, Bucket size: 4, Fingerprint size: 8

cuckooFilter.insert('apple');
cuckooFilter.insert('banana');
cuckooFilter.insert('cherry');

console.log(cuckooFilter.contains('apple')); // Output: true
console.log(cuckooFilter.contains('grape')); // Output: false

Pros: Efficient membership testing, compact representation.
Cons: Limited support for deletion, additional memory for fingerprint storage.

Use Cases:

  • Network routers and switches: Cuckoo Filters are used to efficiently store and query large sets of network flow information, such as source/destination IP addresses, port numbers, or protocols. They provide a compact data structure for fast lookup and filtering of network flows.
  • Web caches and content delivery networks (CDNs): Cuckoo Filters are used to store the URLs or content signatures of web pages or files in a cache or CDN. They help quickly determine if a requested resource is already cached, avoiding unnecessary retrieval from the origin server.
  • Anti-malware and spam filters: Cuckoo Filters are utilized in malware detection systems and email spam filters. They allow for fast matching and identification of known malicious URLs, file hashes, or email signatures, enabling efficient detection and prevention of malicious content.

5. Quotient Filter: A Quotient Filter is a probabilistic data structure similar to a Bloom Filter, but with some advantages. It provides a space-efficient representation of a set of elements and can efficiently test whether an element is a member of a set. The main advantage of a Quotient Filter over a Bloom Filter is that it supports deletions and has better cache performance due to its locality of reference.

class QuotientFilter {
constructor(size) {
this.size = size;
this.table = new Array(size).fill(null);
}

// Hash function to generate a hash value for the input
hash(value) {
const hash = require('crypto').createHash('sha256');
const input = typeof value === 'string' ? value : value.toString();
hash.update(input);
return parseInt(hash.digest('hex'), 16);
}

// Insert an element into the Quotient Filter
insert(element) {
const hashValue = this.hash(element);
const quotient = hashValue % this.size;
const remainder = Math.floor(hashValue / this.size);

if (this.table[quotient] === null) {
this.table[quotient] = [remainder];
} else {
this.table[quotient].push(remainder);
}
}

// Check if an element is in the Quotient Filter
contains(element) {
const hashValue = this.hash(element);
const quotient = hashValue % this.size;
const remainder = Math.floor(hashValue / this.size);

if (this.table[quotient] === null) {
return false;
} else {
return this.table[quotient].includes(remainder);
}
}
}

// Usage example
const quotientFilter = new QuotientFilter(100);
quotientFilter.insert('banana');
console.log(quotientFilter.contains('banana')); // true
console.log(quotientFilter.contains('apple')); // false (or true if there's a false positive)

Pros: Fast membership testing, memory-efficient.
Cons: Limited support for deletion, possibility of false positives.

Use Cases:

  • Network Routing: Quotient Filters can be used in routers to store a large number of network routes efficiently. They allow for fast routing table lookups by determining whether a given IP address belongs to a particular route.
  • Caching: Quotient Filters can be used in caching systems to quickly determine if a particular object or resource is present in the cache. This helps in reducing expensive disk or database lookups by caching frequently accessed items.
  • Duplicate Detection: Quotient Filters can be used to identify duplicate elements in a large dataset. By inserting elements into the filter, you can quickly check if an incoming element is a duplicate, enabling efficient deduplication processes.

6. T-Digest: A T-Digest (short for “tree digest”) is an approximate data structure that trades off accuracy for a low memory footprint and is particularly useful when dealing with large datasets or real-time data analysis where traditional sorting-based approaches become impractical. It provides an approximation of the underlying distribution of the data, allowing for efficient and accurate estimation of percentiles.

class TDigest {
constructor(compression) {
this.compression = compression;
this.centroids = [];
}

add(value, weight) {
// Step 1: Find the nearest centroid to the given value
const nearestCentroid = this.findNearestCentroid(value);

if (nearestCentroid !== -1) {
// Step 2: Update the nearest centroid by adding the weight
this.centroids[nearestCentroid].add(value, weight);
} else {
// Step 3: Create a new centroid if no nearest centroid found
const newCentroid = new Centroid(value, weight);
this.centroids.push(newCentroid);
}

// Step 4: Merge nearby centroids to maintain compression
this.mergeCentroids();
}

findNearestCentroid(value) {
let nearestCentroid = -1;
let minDistance = Infinity;

for (let i = 0; i < this.centroids.length; i++) {
const distance = Math.abs(this.centroids[i].mean - value);

if (distance < minDistance) {
minDistance = distance;
nearestCentroid = i;
}
}

return nearestCentroid;
}

mergeCentroids() {
this.centroids.sort((a, b) => a.mean - b.mean);

let i = 0;
while (i < this.centroids.length - 1) {
const curr = this.centroids[i];
const next = this.centroids[i + 1];

const combinedWeight = curr.weight + next.weight;
const combinedMean =
(curr.mean * curr.weight + next.mean * next.weight) / combinedWeight;

if ((next.mean - curr.mean) * combinedWeight <= this.compression) {
curr.mean = combinedMean;
curr.weight = combinedWeight;
this.centroids.splice(i + 1, 1);
} else {
i++;
}
}
}

estimateQuantile(quantile) {
const desiredWeight = quantile * this.getTotalWeight();
let cumulativeWeight = 0;

for (const centroid of this.centroids) {
cumulativeWeight += centroid.weight;

if (cumulativeWeight >= desiredWeight) {
return centroid.mean;
}
}

return this.centroids[this.centroids.length - 1].mean;
}

getTotalWeight() {
let totalWeight = 0;

for (const centroid of this.centroids) {
totalWeight += centroid.weight;
}

return totalWeight;
}
}

class Centroid {
constructor(mean, weight) {
this.mean = mean;
this.weight = weight;
}

add(value, weight) {
const combinedWeight = this.weight + weight;
const combinedMean =
(this.mean * this.weight + value * weight) / combinedWeight;

this.mean = combinedMean;
this.weight = combinedWeight;
}
}

// Create a new T-Digest with a compression factor of 100
const tDigest = new TDigest(100);

// Add data points to the T-Digest
tDigest.add(10, 1);
tDigest.add(20, 2);
tDigest.add(30, 3);
tDigest.add(40, 4);
tDigest.add(50, 5);

// Estimate the 80th percentile
const estimatedQuantile = tDigest.estimateQuantile(0.8);
console.log('Estimated 80th percentile:', estimatedQuantile); // 36.666666666666664

Pros: Accurate estimation of percentiles and quantiles, memory-efficient.
Cons: Slower update operations, approximation error for extreme quantiles.

Use Cases:

  • Real-time analytics: T-Digest is efficient for estimating percentiles in real-time data analysis scenarios. For example, in monitoring systems, it can be used to track the distribution of response times, allowing developers to quickly identify anomalies or performance issues.
  • A/B testing: When conducting A/B tests on large-scale systems, T-Digest can be used to estimate percentiles of various metrics, such as conversion rates or click-through rates. It enables efficient and accurate analysis of experimental results, even with large sample sizes.
  • Machine learning: T-Digest can be applied in machine learning pipelines to estimate quantiles of features or predictions. It is useful in scenarios where a large dataset needs to be summarized efficiently, such as outlier detection, data preprocessing, or constructing histograms for model interpretation.

7. Lossy Counting: Lossy Counting is a probabilistic data structure designed to estimate the frequencies of items in a data stream while using limited memory by periodically removing low-count elements from the frequency table. This makes it suitable for identifying elements in large and continuously evolving datasets whose frequency count exceeds a user-given threshold.

class LossyCounting {
constructor(epsilon) {
this.epsilon = epsilon; // Maximum error tolerance
this.bucketSize = Math.ceil(1 / epsilon); // Size of each frequency bucket
this.buckets = new Map(); // Map to store item-frequency pairs
this.totalItems = 0; // Total number of items processed
this.currentBucketIndex = 0; // Index of the current bucket
}

add(item) {
this.totalItems++;

if (this.buckets.has(item)) {
// Increment the frequency of existing item in the current bucket
const count = this.buckets.get(item);
this.buckets.set(item, count + 1);
} else {
// Add the item to the current bucket with a frequency of 1
this.buckets.set(item, 1);
}

// Check if it is time to remove infrequent items from the data structure
if (this.totalItems % this.bucketSize === 0) {
this.removeInfrequentItems();
}
}

removeInfrequentItems() {
for (const [item, count] of this.buckets.entries()) {
if (count <= this.currentBucketIndex) {
this.buckets.delete(item);
}
}

this.currentBucketIndex++;
}

estimateFrequency(item) {
if (this.buckets.has(item)) {
return this.buckets.get(item);
}

return 0;
}
}

// Create a new Lossy Counting data structure with epsilon of 0.1
const lossyCounting = new LossyCounting(0.1);

const dataStream = ['A', 'B', 'A', 'C', 'A', 'B', 'A', 'A', 'C', 'A'];

// Add items to the data structure
dataStream.forEach((element) => lossyCounting.add(element));

// Estimate the frequency of items
console.log("Frequency of 'A':", lossyCounting.estimateFrequency('A')); // 6
console.log("Frequency of 'B':", lossyCounting.estimateFrequency('B')); // 2
console.log("Frequency of 'C':", lossyCounting.estimateFrequency('C')); // 2
console.log("Frequency of 'D':", lossyCounting.estimateFrequency('D')); // 0

Pros: Efficient estimation of item frequencies, constant memory usage.
Cons: May have higher error rates for low-frequency items, not suitable for precise frequency counting.

Use Cases:

  • Web analytics: Lossy Counting is useful for estimating the frequency of web page visits or clickstream events. It allows for real-time monitoring and analysis of user interactions, helping to identify popular pages, detect anomalies, or perform traffic analysis.
  • Network traffic analysis: Lossy Counting can be applied to network traffic monitoring to estimate the frequencies of different types of network packets or protocols. It enables efficient traffic analysis and anomaly detection, helping to identify patterns or potential security threats.
  • Market basket analysis: In retail or e-commerce settings, Lossy Counting can estimate item frequencies in customer transaction data. It enables the identification of frequently co-occurring items, such as items commonly purchased together, helping businesses to optimize inventory management, suggest product recommendations, or design targeted marketing campaigns.

8. Skip List: A Skip List is a probabilistic data structure that allows for efficient searching, insertion, and deletion operations in logarithmic time complexity. It is designed as an alternative to balanced binary search trees, providing a simpler implementation while maintaining similar performance guarantees.

class SkipListNode {
constructor(value, level) {
this.value = value; // Node value
this.forward = new Array(level + 1); // Array of references to next nodes
}
}

class SkipList {
constructor() {
this.head = new SkipListNode(null, 0); // Head node
this.maxLevel = 0; // Maximum level of the Skip List
}

randomLevel() {
let level = 0;
while (Math.random() < 0.5 && level < this.maxLevel + 1) {
level++;
}
return level;
}

insert(value) {
const update = new Array(this.maxLevel + 1); // Array to track update pointers
let currentNode = this.head;

for (let i = this.maxLevel; i >= 0; i--) {
while (currentNode.forward[i] && currentNode.forward[i].value < value) {
currentNode = currentNode.forward[i];
}
update[i] = currentNode;
}

currentNode = currentNode.forward[0];

if (!currentNode || currentNode.value !== value) {
const newNodeLevel = this.randomLevel();
if (newNodeLevel > this.maxLevel) {
for (let i = this.maxLevel + 1; i <= newNodeLevel; i++) {
update[i] = this.head;
}
this.maxLevel = newNodeLevel;
}

const newNode = new SkipListNode(value, newNodeLevel);

for (let i = 0; i <= newNodeLevel; i++) {
newNode.forward[i] = update[i].forward[i];
update[i].forward[i] = newNode;
}
}
}

search(value) {
let currentNode = this.head;

for (let i = this.maxLevel; i >= 0; i--) {
while (currentNode.forward[i] && currentNode.forward[i].value <= value) {
if (currentNode.forward[i].value === value) {
return true; // Value found
}
currentNode = currentNode.forward[i];
}
}

return false; // Value not found
}

remove(value) {
const update = new Array(this.maxLevel + 1); // Array to track update pointers
let currentNode = this.head;

for (let i = this.maxLevel; i >= 0; i--) {
while (currentNode.forward[i] && currentNode.forward[i].value < value) {
currentNode = currentNode.forward[i];
}
update[i] = currentNode;
}

currentNode = currentNode.forward[0];

if (currentNode && currentNode.value === value) {
for (let i = 0; i <= this.maxLevel; i++) {
if (update[i].forward[i] !== currentNode) {
break;
}
update[i].forward[i] = currentNode.forward[i];
}

while (this.maxLevel > 0 && this.head.forward[this.maxLevel] === null) {
this.maxLevel--;
}
}
}
}

// Create a new Skip List
const skipList = new SkipList();

const dataStream = [10, 20, 15, 5];

// Insert values into the Skip List
dataStream.forEach((element) => skipList.insert(element));

// Search for a value in the Skip List
console.log('Search for 15:', skipList.search(15)); // Output: true

// Remove a value from the Skip List
skipList.remove(20);

// Search for the removed value
console.log('Search for 20:', skipList.search(20)); // Output: false

Pros: Efficient insertion, deletion, and search operations, simple implementation.
Cons: Increased memory usage compared to other probabilistic data structures.

Use Cases:

  • In-memory databases: Skip Lists are commonly used as the underlying data structure in in-memory databases to provide efficient indexing and searching capabilities. They allow for fast lookups, insertions, and deletions, making them suitable for high-performance database systems.
  • Concurrent data structures: Skip Lists can be adapted to support concurrent operations by applying synchronization techniques. They offer good concurrency control due to their inherent hierarchical structure, making them valuable for concurrent data structures like concurrent sets, maps, and priority queues.
  • Range queries: Skip Lists efficiently support range queries, where you need to retrieve all elements within a specific range. By leveraging the hierarchical nature of Skip Lists, you can efficiently identify and iterate over elements within a given range, enabling efficient range-based operations in various applications such as database systems and search engines.

Note: The code snippets provided here are simplified implementations for educational purposes and may not be as optimized or performant as production-ready implementations.

--

--