What the heck is a Bloom Filter anyway

Satvik Nema
5 min readJun 1, 2024

--

If you work in the software industry, you must have heard the term “bloom filter” at least once while going through any LLD/HLD concept. Lets demystify the data structure.

A Bloom filter is a probabilistic data structure which can efficiently tell whether an element is present in the set or not.

  1. when it says no, there’s a guarantee that the element is not in the set
  2. when it says yes, the element maybe present in the set.

Under the hood, the traditional bloom filter uses a bit array (only storing 1s and 0s) and h hash functions.

To add an element, say “hello”, it runs the input through the h hash functions. These h hash outputs are then mod by the size of the bit array to get values between 0 and h-1. These index positions are then set to 1 on the array.

When a query comes in, say for “world”, it again finds out the corresponding positions on the bit array and checks if all of them are 1 .

If yes, it returns true, false otherwise.

Example

So why is the “yes” probabilistic and “no” is a guarantee? Consider an example with h=3 and bit array with 20 bits and the words “Hello” and “World” both already in the filter. It might look like this:

Now for a guaranteed “no”, lets say query comes in for “John”

As the bit position10 is not set, we return false, even through other 2 bits were set. Here we can assert that even if one of the bit position has not been set, there’s a 100% guarantee that this element has not been inserted in the filter.

Now coming to a probabilistic “yes”, lets say query comes in for “Doe”

Bit positions 2 , 11 and 18 are all set. So the filter returns “yes”. But it is just by chance “Doe” hashed to all 1s. If you look closely, the bits 2 and 18 were set by “Hello” and 11 was set by “World”.

Hence even though the filter returns “yes”, the word “Doe” is not actually present in the set.

Wondering how would you implement this in your favourite programming language? Wrote a separate article which talks about implementing all of this in Java.

Time for some math

How to measure this probabilistic behaviour?

Assuming our bit array size is b bits, using h hash functions. Probability of a bit being set after using 1 hash function is

(1)

Probability of a bit NOT being set after using 1 hash function is

(2)

Hence the probability of a bit set to 0 after running through h hash functions and n insertions will be.

(3)

So AFTER n insertions, each time with h hash functions, the probability that the bit is wrongfully set to 1 will be 1 minus this value:

(4)

Note: this is different from a bit being rightfully set to 1 after n.k iterations, which is simply:

(5)

Equation (5) represents the probability that a particular bit is set to 1. While equation (4) represents the probability that a bit which was supposed to be 0 after h.n iterations is still 1 . Hence (4) will help us in finding out the error rate.

For the bloom filter to return true, the hashes of all h hash functions should resolve to a 1 . Hence the probability that the bloom filter wrongfully returns true will be:

(6)

(6) denotes the error rate for a bloom filter.

To find optimal number of hash function h, we find where the minima occurs for E while taking n and b as constants.

Not going into too much mathematical details, h turns out to be:

(7)

Substituting (7) in equation (6), we get

(8)

Refer this proof and this blog for detailed explanation for the derivations (7) and (8)

So with expected number of insertions n and the error threshold which can be tolerated E we can determine how large a bit array we should take b and how many hash functions should be used h .

Why should I use a bloom filter if a set does all of this without any probabilistic uncertainty? One word answer: SPACE. The main advantage of using this structure is it does NOT store the elements itself (while a set does). So the memory footprint reduces dramatically, speeding up computations.

This stackoverflow thread will convince you

Note we cannot remove an item from a bloom filter. It never forgets. If we do need this use case, we might need to use a “counting” version of a bloom filter, were instead of a bit, we store an entire integer to count how many times an element has been seen and increment it on addition, decrement it on removal. So the bit array becomes an integer array.

Applications

  1. Used in LSM based key value stores. Greatly helps in ‘early pruning’ for a search if a particular key is not available in the particular memtable. May save yourself scanning from 100s of files
  2. When websites suggest you not to use a password, a set of these commonly used passwords are stored on a bloom filter which quickly tells them if it already has been used. False positive is not an issue, as the user might simply choose another password
  3. Content delivery networks like Akamai use bloom filters to prevent “one hit wonders”. These are web pages which are queried once and never again. You wouldn't want these stored on caches. So before adding an entry onto the cache, you check if the bloom filter already has it, i.e, you’ve seen it twice.

References

  1. https://people.eecs.berkeley.edu/~daw/teaching/cs170-s03/Notes/lecture10.pd
  2. https://www.youtube.com/watch?v=V3pzxngeLqw
  3. https://medium.com/@humberto521336/bloom-filters-mathematical-proof-8aa2e5d7b06b
  4. https://stackoverflow.com/questions/4282375/what-is-the-advantage-to-using-bloom-filters
  5. https://en.wikipedia.org/wiki/Counting_Bloom_filter
  6. https://medium.com/@satviknema/implementing-a-bloom-filter-in-java-from-scratch-878976f3e530

--

--