A Beginner’s Guide to Bloom Filter
How to efficiently check if a username is registered?
Given a username on a user sign-up page, how do we tell if it has already been registered?
While querying an indexed database helps, it’s slow and incurs network calls.
To speed things up, we can cache the list of registered usernames in a key-value store like Redis.
However, that implies caching millions of records and doubling our memory footprint.
How can we do better in this seemingly-trivial problem?
The bloom filter might be the answer, let’s check it out!
What is a Bloom Filter?
A bloom filter answers one simple question,
Does an element exist in a given set?
A bloom filter is a probabilistic data structure. Given the question above, it outputs either of the following answers
- Probably Yes
- 100% No
In other words, a false positive is LIKELY, while a false negative will NEVER occur.
And its biggest advantage is, it does this in CONSTANT time and space.
How does it work?
A bloom filter consists of two components
- An N-size bit array
- Several hashing functions
Initialisation
It is first initialised as an N-size bit array with all its bits set to zero. Let’s assume the array’s length to be 10 for now.
Adding an item
Adding an item is trivial
- The item said “tiger”, is hashed using a hashing function
- The generated hash is mod by the array length to obtain a bounded index
- The index of the bit array is then set to 1
Checking if an item exists
Similar to adding an item, we hash the element using a hashing function and mod it to obtain a bounded index.
The output is evaluated as follows,
- If the index value of the bit array is 0, the item is NOT in the set.
- Otherwise, the item is PROBABLY in the set
An item in a bloom filter can NEVER be removed since multiple items might share the same index.
Storing a bloom filter
Instead of storing the bloom filter as an array, we can convert its bit representation into a decimal number.
For instance, we can convert an array containing 10011
into 19 and store it in a cache.
If the list doesn’t change very often, the server can send the decimal number to the client, allowing validation to be done on the client side.
Can we do better?
If the hashing function outputs the index 1 for both “tiger” and “cow”, checking if “cow” is in the set yields the answer Yes even if it’s not.
We can reduce the chance of false positives via the following solutions.
- Increase the length of the array
- Increase the number of hashing functions
Increasing the length of the array reduces the chance of collision.
Instead of one index, we can obtain multiple indexes using several hashes.
When adding an item, all obtained indexes will be set to 1.
An item is claimed to be probably in the set, only if ALL the indexes are set to 1.
Leveraging these methods, we could significantly lower the probability of false positives.
Applications
Let’s take a look at some real-life examples.
Check if a username exists in a user sign-up flow
- When a username is created, the username is added to a bloom filter stored in a key-value store.
- When a user keys in a username on a user sign-up page, the server first queries the bloom filter.
- If the username is NOT in the bloom filter, the server instantly returns an error to the client.
- Otherwise, the server queries and cross-checks in the database.
Check if a user has read a Medium article
- Medium maintains a bloom filter for each user.
- Before recommending an article, Medium checks if the article ID exists in the user’s bloom filter.
- Articles that are certainly NOT in the bloom filter are recommended to the user.
Check if a URL is malicious in Chrome
- When accessing an URL, Chrome first validates if the URL is part of a malicious list.
- Instead of querying the Google server each time, Google builds a bloom filter using a predetermined malicious list and sends it to the browser.
- The browser hashes the URL and cross-checks with the bloom filter before accessing the website.
Closing
While there might be false positives, a bloom filter is handy when we want to tell if an item is definitely not in a list.
It can be used as a first layer of filtering due to its efficiency in both time and space.
I hope you find this helpful, and I will see you at the next one!