Demystifying Bloom Filters with Real-Life Examples

Abhishek Ranjan
5 min readMay 31, 2023

--

Photo by JJ Ying on Unsplash

Hey there! Let’s chat about a nifty data structure that’s the secret weapon of the tech world — the Bloom filter. If it sounds a bit mystic or even complicated, don’t fret. This article is your one-stop guide to understanding Bloom filters, how they work, and the magic they do in real-world applications. We’ll keep things casual, friendly, and clear!

What is a Bloom Filter?

First up, a quick definition. A Bloom filter is a space-efficient probabilistic data structure that’s used to test whether an element is a member of a set. In everyday terms, it’s like a super memory that knows a lot without taking up too much space!

When you ask a Bloom filter if it remembers a specific item, it will answer either “probably yes” or “definitely no.” If it says “probably yes,” there’s a slight chance that it could be wrong. But if it says “definitely no,” it’s totally confident about the answer.

How Does a Bloom Filter Work?

Now that we’re clear on what a Bloom filter is, let’s look at how it works.

At its core, a Bloom filter involves:

  1. Bit array: A sequence of 1s and 0s (bits) that the Bloom filter uses to remember stuff. Initially, all bits are set to zero.
  2. Hash functions: These are like Bloom filter’s superpowers. They take the item you want to remember and spit out positions in the bit array. Depending on the Bloom filter, it might use one or more hash functions.

Here’s a visual walkthrough:

The process of adding items is simple. You take your item, pass it through the hash functions, which in turn give out positions in the bit array. You then set the bits at these positions to 1. When you want to check if an item is in the set, you pass the item through the same hash functions. If all bits at the output positions are 1, the Bloom filter says “probably yes.” If any bit is 0, it confidently declares “definitely no.”

Bloom filters, however, can result in false positives. This happens when the bits for a queried item are all set to 1 by coincidence, even though the item was never actually added to the filter. Nevertheless, there won’t be any false negatives; if the item was added, the Bloom filter will surely remember it!

Bloom Filters in Real Life

Alright, let’s explore some real-life applications of Bloom filters. You’d be surprised to know that they’re quite popular!

1. Web Browsers: Google Chrome

Google Chrome uses a Bloom filter to protect you from harmful URLs. It goes something like this:

Whenever you try to access a URL, Google Chrome first checks with a local Bloom filter in your browser. This filter is filled with a list of hashed malicious URLs. If the Bloom filter returns “probably yes,” Chrome reaches out to Google’s servers for further verification. If the Bloom filter says “definitely no,” Chrome doesn’t bother with the server check, saving you time and preserving server resources.

The following code will show you the structure of a Bloom filter and how you can use it to check the presence of an element.

Please note that this is a basic example for educational purposes and may not cover all edge cases or optimizations. Also, the case of Google Chrome involves a list of malicious URLs that is constantly updated and managed by Google’s servers. This example will just illustrate the process of adding URLs to the Bloom filter and then checking for their existence.

import java.util.BitSet;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class BloomFilter {
private BitSet bitSet;
private int bitSetSize;
private int numOfHashFunctions;

public BloomFilter(int size, int numOfHashFunctions) {
this.bitSetSize = size;
this.numOfHashFunctions = numOfHashFunctions;
this.bitSet = new BitSet(bitSetSize);
}

// Add element to Bloom Filter
public void add(String url) throws NoSuchAlgorithmException {
for (int i = 0; i < numOfHashFunctions; i++) {
int hashCode = getHash(url, i);
bitSet.set(Math.abs(hashCode % bitSetSize));
}
}

// Check if element is present in Bloom Filter
public boolean mightContain(String url) throws NoSuchAlgorithmException {
for (int i = 0; i < numOfHashFunctions; i++) {
int hashCode = getHash(url, i);
if (!bitSet.get(Math.abs(hashCode % bitSetSize))) {
return false;
}
}
return true;
}

// Computes the i-th hash function for the given URL
private int getHash(String url, int i) throws NoSuchAlgorithmException {
MessageDigest md5 = MessageDigest.getInstance("MD5");
md5.update(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(i).array());
md5.update(url.getBytes());
byte[] digest = md5.digest();
int hash = ByteBuffer.wrap(digest).getInt();
return hash;
}
}

Here is how you can use this Bloom filter:

public static void main(String[] args) throws NoSuchAlgorithmException {
BloomFilter bloomFilter = new BloomFilter(1000000, 3);

// Add some URLs to the Bloom filter
bloomFilter.add("http://example1.com");
bloomFilter.add("http://example2.com");

// Check if URLs are present in the Bloom filter
System.out.println(bloomFilter.mightContain("http://example1.com")); // Outputs: true
System.out.println(bloomFilter.mightContain("http://example3.com")); // Outputs: false
}

In this code, we’re using a Java BitSet to represent the bit array and Java's MessageDigest class to create hash functions. We're creating multiple hash functions by incorporating the index (i.e., the i parameter in getHash method) into our hashing mechanism. This changes the salt of the hash function, effectively giving us multiple hash functions.

Remember, this is a very simple example, and the real use case for Google Chrome would involve many optimizations and considerations for updates, scaling, and security.

2. Databases: Apache Cassandra and HBase

Big data systems like Apache Cassandra and HBase use Bloom filters for quick data retrieval. Here’s how it works:

Instead of doing a slow disk read every time they need to check if specific data exists, these databases use a Bloom filter for a quick first check. The Bloom filter has a smaller memory footprint, which makes this operation much faster. If the Bloom filter returns “probably yes,” the system may perform a disk read to confirm. But if the filter says “definitely no,” the system avoids the disk read, saving a lot of time.

Conclusion

I hope this dive into the world of Bloom filters has made them less of a mystery. They’re an incredible tool, especially in the era of big data, where efficient, time-saving structures like Bloom filters become increasingly important.

These data structures might seem simple, but their applications are vast, and they’re instrumental in many domains. So next time you come across the term Bloom filter, you can confidently say, “Yeah, I know what that is!”

🔗 Connect with me on LinkedIn!

I hope you found this article helpful! If you’re interested in learning more and staying up-to-date with my latest insights and articles, don’t hesitate to connect with me on LinkedIn.

Let’s grow our networks, engage in meaningful discussions, and share our experiences in the world of software development and beyond. Looking forward to connecting with you! 😊

Follow me on LinkedIn ➡️

--

--