Goodbye to Slow Searches: The Bloom Filter Breakthrough! 🔍

Published in

tech@iiit-gwalior

6 min readMar 3, 2024

Imagine a big social media company with billions of users looking for a unique username. Every new signup comes with a flood of creative suggestions, from the classic “username123” to the more imaginative “CosmicDancingLlama.” But how can the platform ensure these usernames are truly unique amidst a sea of existing accounts? The answer lies in a clever data structure called a Bloom filter, a digital tool that helps identify “maybes” with surprising speed and efficiency.

The classic way of handling this search problem is linear search, which sequentially traverses the entire list of usernames and then checks if the selected username already exists. Of course, this is a nightmare regarding the time complexity of the search. Another commonly used algorithm is the Binary Search, which reduces the time complexity of the search to a logarithmic function. But operating at such a large scale requires something more efficient.

Enter Bloom Filters, a space-efficient probabilistic data structure that tests whether an element is a set member. Imagine Bloom Filter as a function that answers whether an element is present in a set with a FIRM NO or a PROBABLY YES.

The Bloom Filter always answers as a “FIRM NO” or a “PROBABLY YES.”

How does Bloom Filter work?

Now, let’s dive into the workings of a Bloom Filter. It consists majorly of two building blocks:

Bit Array: This is a large array of individual bits, typically represented as 0s and 1s. It serves as the storage space for the Bloom filter.
Hashing Function: This is a mathematical function that takes an input (like a username) and produces a unique output (a hash). This hash determines which specific bits in the bit array need to be modified upon adding or checking an element.

Let’s take an example to understand it further. Continuing with the real-world example of a social media platform, let’s build a bit array with 10 bits, each bit initially set to 0. We have three hashing functions

Inserting Usernames:

Let’s add two usernames: “CatLover” and “Bookworm”. For each username:

All three hashing functions generate unique codes (numbers between 1 and 10 in this example).
Each code corresponds to a specific bit in the bit array.
We turn all three corresponding bits to 1 (on).

“CatLover”: Hash functions give codes 3, 5, and 8. Bits 3, 5, and 8 in the array are turned to 1.

“Bookworm”: Codes are 2, 7, and 9. Bits 2, 7, and 9 are turned to 1.

The final array would look like this:

The Lookup:

For the lookup process, someone wants to create a new account with the username: “TechFan”.

The hashing function will generate codes for it: 1, 5, and 6.
With the generated bit array, we can conclude that all three bits, 1, 5 and 6, are not set, and hence, “TechFan” can be taken up as a username.

Another user wants to take up the username: “Programmer”. We shall again go with the same process:

The hashing function will generate codes for it: 2, 5 and 9.
With the existing array, we find that all three bits (2, 5 and 9) are already set, and hence, “Programmer” might be present in the set and cannot be taken up as a username.

From this example, we can see the case of “false positives”. This shows the probabilistic nature of bloom filters. Since multiple words can hash to the same bit, a set bit doesn’t guarantee the element exists, only that it might. We might need to consult the actual username database for confirmation. Nevertheless, there won’t be any false negatives; if the item was added, the Bloom filter will surely remember it!

Limitations of Bloom Filters

While Bloom filters offer a compelling combination of speed and space efficiency, they come with a few limitations:

False positives: The inherent trade-off for speed is the possibility of encountering false positives. This means the filter might indicate the presence of an element (like a username) even though it doesn’t exist.

For cases where “false-positive” rates get as high as 80% — 90% (usually when almost all the bits are set of a bloom filter), the bloom filter is resized (increased) to address this. All the previously inserted items are re-hashed and inserted to resize a bloom filter. Therefore, a proper initial filter size has to be decided to minimize these scenarios.

No deletions: Once an element is added to the Bloom filter, it cannot be removed in its basic implementation. This can be problematic when data needs to be updated or deleted. An improvised version of bloom filters, termed “Counting Bloom Filters,” is introduced to address this issue. While they support the deletion operation, they increase the space requirements of the data structure.

Another optimised Bloom Filter implementation is the “Deletable Bloom Filter”. You can read more about it in the paper. In this implementation, another bit array keeps track of bits (regions) where collisions happen, and while deleting an item, if a bit has been set only because of that item (no collisions), that bit can be set to ‘0’.

Limited information retrieval: Bloom filters only tell you whether something might be present, not what it is. This means you must consult another source (like a database) to retrieve the specific data.
Dependence on hash functions: The effectiveness of a Bloom filter heavily relies on the quality of its hash functions. Suppose the functions must be well-designed or create more collisions (multiple elements mapping to the same bit). In that case, the filter can become unreliable and experience more false positives.

Despite these limitations, Bloom filters remain valuable tools for handling large datasets where speed and efficiency are crucial.

Applications of Bloom Filter

Bloom filters, despite their limitations, find application in various fields due to their efficient use of space and speed and applications where “false positives” are tolerable:

1. Personalized Recommendations on Medium: To ensure you encounter new and engaging content, Medium utilizes Bloom filters. These filters act as a memory bank, keeping track of the articles you’ve already seen. This allows the platform to filter out previously viewed posts, presenting you with a fresh selection of content tailored to your interests.

2. Streamlining Stories on Quora: Quora, an online knowledge-sharing platform, leverages Bloom filters to ensure you don’t encounter the same stories repeatedly. They implement a shared Bloom filter, acting like a collective memory for the entire feed. This filter efficiently identifies and removes previously seen stories from your feed, saving valuable time and preventing information overload.

3. Securing Browsing with Chrome: While no longer in use, Google Chrome previously employed Bloom filters as a vital line of defence. These filters served as digital sentinels, scanning website addresses (URLs) against a known list of malicious websites. Identifying potential threats quickly helped protect users from encountering harmful content and safeguarding their online security.

4. Optimizing Database Queries: Behind the scenes of various significant databases, like Google BigTable, Apache HBase, Apache Cassandra, and even Postgresql, Bloom filters play a crucial role in reducing the number of times the database needs to search its storage for data physically. These filters act like intelligent gatekeepers, efficiently checking if the specific information you seek exists in the database before initiating a full-fledged search. This significantly improves query performance and streamlines the process of retrieving relevant data.

Bloom Filters help you reduce unnecessary disk access. This makes them amazing for applications with lots of data. | Credits: https://medium.com/geekculture/a-quick-introduction-to-bloom-filters-eeba404b13a2

By understanding the power of “maybes,” Bloom filters offer a valuable solution for various platforms, enabling them to manage information efficiently, personalize user experience, and ultimately, shape how we interact with the vast online world.