A Beginner’s Guide to Bloom Filter

How to efficiently check if a username is registered?

Jason Ngan

Published in

CodeX

5 min readNov 25, 2022

Given a username on a user sign-up page, how do we tell if it has already been registered?

While querying an indexed database helps, it’s slow and incurs network calls.

To speed things up, we can cache the list of registered usernames in a key-value store like Redis.

However, that implies caching millions of records and doubling our memory footprint.

How can we do better in this seemingly-trivial problem?

The bloom filter might be the answer, let’s check it out!

What is a Bloom Filter?

A bloom filter checks if an item is in a set

A bloom filter answers one simple question,

Does an element exist in a given set?

A bloom filter is a probabilistic data structure. Given the question above, it outputs either of the following answers

Probably Yes
100% No

In other words, a false positive is LIKELY, while a false negative will NEVER occur.

And its biggest advantage is, it does this in CONSTANT time and space.

How does it work?

A bloom filter consists of two components

An N-size bit array
Several hashing functions

Initialisation

It is first initialised as an N-size bit array with all its bits set to zero. Let’s assume the array’s length to be 10 for now.

Adding an item

An item is hashed and mod to obtain a bounded index

Adding an item is trivial

The item said “tiger”, is hashed using a hashing function
The generated hash is mod by the array length to obtain a bounded index
The index of the bit array is then set to 1

Checking if an item exists

If index is set to 1, item is PROBABLY in the set. Otherwise, it’s CERTAINLY NOT in the set.

Similar to adding an item, we hash the element using a hashing function and mod it to obtain a bounded index.

The output is evaluated as follows,

If the index value of the bit array is 0, the item is NOT in the set.
Otherwise, the item is PROBABLY in the set

An item in a bloom filter can NEVER be removed since multiple items might share the same index.

Storing a bloom filter

Instead of storing the bloom filter as an array, we can convert its bit representation into a decimal number.

For instance, we can convert an array containing 10011 into 19 and store it in a cache.

If the list doesn’t change very often, the server can send the decimal number to the client, allowing validation to be done on the client side.

Can we do better?

If the hashing function outputs the index 1 for both “tiger” and “cow”, checking if “cow” is in the set yields the answer Yes even if it’s not.

We can reduce the chance of false positives via the following solutions.

Increase the length of the array
Increase the number of hashing functions

Increasing the length of the array reduces the chance of collision.

Instead of one index, we can obtain multiple indexes using several hashes.

When adding an item, all obtained indexes will be set to 1.

An item is claimed to be probably in the set, only if ALL the indexes are set to 1.

Leveraging these methods, we could significantly lower the probability of false positives.

Applications

Let’s take a look at some real-life examples.

Check if a username exists in a user sign-up flow

When a username is created, the username is added to a bloom filter stored in a key-value store.
When a user keys in a username on a user sign-up page, the server first queries the bloom filter.
If the username is NOT in the bloom filter, the server instantly returns an error to the client.
Otherwise, the server queries and cross-checks in the database.

Check if a user has read a Medium article

Medium maintains a bloom filter for each user.
Before recommending an article, Medium checks if the article ID exists in the user’s bloom filter.
Articles that are certainly NOT in the bloom filter are recommended to the user.

Check if a URL is malicious in Chrome

When accessing an URL, Chrome first validates if the URL is part of a malicious list.
Instead of querying the Google server each time, Google builds a bloom filter using a predetermined malicious list and sends it to the browser.
The browser hashes the URL and cross-checks with the bloom filter before accessing the website.

Closing

While there might be false positives, a bloom filter is handy when we want to tell if an item is definitely not in a list.

It can be used as a first layer of filtering due to its efficiency in both time and space.

I hope you find this helpful, and I will see you at the next one!