Bloom filter

Efficient way to filter out

Patryk Szlagowski
5 min readApr 1, 2024

Bloom filter is a data structure that efficiently solves the problem of set membership queries, often used in scenarios where memory is constrained or access to a disk is slow. Bloom filter structure helps us to answer the question: “is the element present in the set?”. But the trick is that we cannot answer “100% yes”, but “100% no”. It gives us the advantage to filter out elements that are not present in the set immediately.

Bloom filter is a structure that allows to answer with full certainty that the element does not exist in the set.

In this blog post, we’ll dive into the inner workings of Bloom filter using a simple example.

TL;DR how bloom filter works

How bloom filter looks like?

Usually, bloom filter is a set of characters that are the output from hashing function, but for this example, we will simplify it.

Let’s make an assumption: let’s define the set of characters we want to use — whole latin alphabet with the indexes for each letter:

a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25|

Now, we need to initialize our empty bloom filter filled up with 0 — it will represent that “every bit that is equivalent of letter in alphabet is not present in the bloom filter”

| - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |

Now, having empty bloom filter, let’s add some words to it. Imagine, we want to hold the information about car brands we have in the garage (BMW, Mercedes and Audi).

Firstly, we need to find out what are the indexes of each letter. In this example, we are case insensitive:

index("BMW") = [1, 12, 22]
index("Mercedes") = [12, 4, 17, 2, 4, 3, 4, 18]
index("Audi") = [0, 20, 3, 8]

We are almost there! Last step of creating the bloom filter is to fill up it with “1” on the index that corresponds the ones we just found:

[0, 1, 2, 3, 4, 8, 12, 17, 18, 20, 22]

So the bloom filter will look like:

| a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
| - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |

We just created our first bloom filter that holds the information! But what we can do with it now?

Searching in the Bloom filter

As I said at the very beginning, Bloom filter is the structure that allows us to answer to the question “is an element A present in the bloom filter?”.

bloomfilter(["BMW", "Mercedes", "Audi"]) = "11111000100010000110101000"

Try to find some values. For this purpose, we need to do the same steps of creating Bloom filter for each:

bloomfilter("Opel") = "00001000000100110000000000"
bloomfilter("Volvo") = "00000000000100100000010000"
bloomfilter("Seat") = "10001000000000000110000000"

Last thing is to compare those with the original one. The easiest way is to perform AND operation between each. If the result will be 26 of “0”, it means that the needle is not present in the stack.

For every bit of haystack bloom filter (HBF), we compare corresponding bit from the needle bloom filter (NBF). If particular NBF bit equals 1 and corresponding HBF bit equals 1, we stop the comparison because the record MIGHT BE present in the bloom filter. If not, continue comparing for each bit until the end of structure. If no “1” bit comes up from the comparison, the answer is “needle is not present in the haystack”

bloomfilter(["BMW", "Mercedes", "Audi"]) = "11111000100010000110101000"
bloomfilter("Opel") = "00001000000100110000000000"
bloomfilter("Volvo") = "00000000000100100000010000"
bloomfilter("Seat") = "10001000000000000110000000"

Opel
00001000000100110000000000 AND
11111000100010000110101000 =
00001000000000000000000000

result contains one “1” because we found entry for “e” in the stack.

Let’s do the same for Volvo and Seat

Volvo
00000000000100100000010000 AND
11111000100010000110101000 =
00000000000000000000000000

Seat
10001000000000000110000000 AND
11111000100010000110101000 =
10001000000000000110000000

At the result, for Volvo, is full of zeros, no “ones” — it means that we didn’t find Volvo in the base Bloom filter.

Where to use Bloom filters?

Bloom filters are suitable solutions for a initial filtering out problem. If we want to find some examples, there are just a few of them:

  • Fruit flies use a modified version of Bloom filters to detect novelty of odors, with additional features including similarity of novel odor to that of previously experienced examples, and time elapsed since previous experience of the same odor.
  • The servers of Akamai Technologies, a content delivery provider, use Bloom filters to prevent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates.
  • Google Bigtable, Apache HBase, Apache Cassandra, ScyllaDB and PostgreSQL use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
  • The Google Chrome web browser previously used a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).

Summary

We just learned what is and how to use Bloom filters. This article tackled the subject. In real-life scenario, we could model our bloom filter differently in order to store values in a smart way.

In the next article, we will use those information and utilize bloom filters functionality in order to verify existing of the anonymized personal data. Stay tuned!

--

--