Introduction to Bloom Filter

Divyansh Chowdhary
8 min readJan 19, 2020

Before we talk about what a Bloom Filter is and how it actually works let us discuss why do we even need a bloom filter. Let’s assume we have implemented a service that generates a unique number every time it is called , and we would like to know, whether the number has already been generated or not.

One way to check that, would be to maintain a list of all generated numbers to check for membership. But we know that would be a bad idea due to O(n) lookup time. So what we do instead, we use a set (or a hash-table) in memory that allows us do quick lookups and test for membership of a number. If it is already present, we generate a new number. Now this works fine as long as our hash-table is small enough so that it can/ reside in the memory. Consider the case for AADHAR for example - clearly there is no way a hash table for a billion plus numbers can reside in main memory. We can surely use the disk for storing and querying but since that is significantly slower compared to accessing the main memory we are not going to consider that case for now.

So, how do we tackle this situation? Is there a data structure that can be stored in main memory and still hold vast amount of data? This is where bloom filters come in.

What is Bloom Filter?

According to Wikipedia, A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not — in other words, a query returns either “possibly in set” or “definitely not in set.”

Let’s break the definition into simpler part, so as to understand it better. Bloom filters are exclusively used to test whether a given element is in set or not (answer “yes” or “no” question). However, they have a very powerful property which allows them to make trade-off between space and false-positive rate when it comes to membership existence. Bloom filters trade exactness for this efficiency meaning that there are false-positives — i.e. elements that are not a part of set but are claimed to be part of the set, thus, it is called probabilistic data structure.

It’s ok, if it is not clear yet. Let’s dive into how it works for a better understanding.

Algorithm

The bloom filter essentially consists of a bit-vector or bit-list(a list containing only either 0 or 1-bit value) of length m, initially all values set to 0, as shown below.

An empty bit-vector of size 10 (m=10)

Now to make entries to this we require k number of hash functions. When we want to add an item x in the filter, we feed it to k different hash functions and set the bits to ‘1’ at the resulting positions.

These k hash function generate k index values at which entries to the bloom filter will be made. Say for e.g. we take three hash functions H1, H2 and H3.

H1(geeks) % 10 = 1
H2(geeks) % 10 = 7
H3(geeks) % 10 = 4
Bits set corresponding to indices calculated by H1, H2 and H3

For another entry in the the set the same procedure is followed:

H1(nerd) % 10 = 3
H2(nerd) % 10 = 4
H3(nerd) % 10 = 5

Since, 4th bit is already set, we don’t set it again.

That was about inserting elements in the filter. In order to check membership the same procedure is repeated in the reverse order. We calculate respective hashes using H1, H2 and H3 ; and check if all these indices are set to 1 in the bit array. If all the bits are set then we can say that element is probably present. If any of the bit at these indices are 0 then it is definitely not present — if it were, then all the bits would have been set to 1 when it was inserted.

False Positives

Now, let us understand why we say, that an element is “probably present”. Suppose we are looking for an element “cat” in the set which we know was not entered in the filter. We take Hash from three functions for this element.

H1(cat) % 10 = 1
H2(cat) % 10 = 7
H3(cat) % 10 = 3

All the indices that the hash functions output are already set in the bit vector. 1 and 7 were set when we entered “geeks” and 3 when we entered “nerd”, but we know that “cat” was never entered in the filter. This results in the false positive result. Since all the bits are set the output will say the element is probably present in the set. So an application which does not make critical decisions based on this output such as checking if a username is present in the database, bloom filters can implement this feature.

You can go here and get an interactive demo, on how bloom filter works.

Optimal Value

So, let’s see how we can reduce the number of false positives returned.

If after reading the above you are thinking to yourself that we just need to reduce the number of collisions to reduce the rate of false positives then you are right. A simple improvement in the above example is to use more hash functions and have a large bit array. If instead of 10 bits we had 100 bits and instead of just 3 hash functions, we had a few more, the probability that a non-existing value to hash to all the set bits would have reduced, thereby, reducing the rate of false positives.

I won’t be going too much deep behind the maths of value, but if you are interested, you can go to here.

Drawbacks and possible solutions

As with everything, this also comes with some cons with it.

The size of the Bloom Filter

The size of the Bloom Filters need to be known a priori based on the number of items that you are going to insert. This is not so great if you do not know or cannot approximate the number of items. You could put an arbitrarily large size, but that would be a waste in terms of space which we are trying to optimize in the very first place and the reason why we adopt to choose Bloom Filter. This could be fixed to create a bloom filter dynamic to the list of items that you want to fit, but depending on the application, this may not be always possible. There is a variant called Scalable Bloom Filter which dynamically adjusts its size for different number of items. This could mitigate some of its shortcomings.

Cannot give the items that you inserted

Bloom Filter cannot produce a list of items that are inserted, you could only check if an item is in it, but never get the full item list because of hash collisions and hash functions. This is due to arguably the most significant advantage over other data structures; its space efficiency which comes with this disadvantage.

Removing an element

Removing an element from the Bloom Filter is not possible, you cannot undo an insertion operation as hash results for different items can be indexed in the same position.

If you want to do undo inserts, we can use a variation of bloom filter called Counting bloom filter. The idea is simple. Instead of storing a single bit of values, we will store an integer value and our bit vector will then be an integer vector. This will increase the size and costs more space to gives us the Removal functionality. Instead of just marking a bit value to ‘1’ when inserting a value, we will increment the integer value by 1. To check if an element exists, check if the corresponding indexes after hashing the element is greater than 0. To remove the element, calculate the indexes after hashing the element, and reduce the corresponding count by 1 of the corresponding indexes.

Another way, is to reconstruct the Bloom Filter from the start excluding a single item. Both methods involve an overhead and not straightforward.

Another alternative for bloom filter which handles deletion also, is Cuckoo filter. You can play with it here.

Applications

Bloom filters reduce the unnecessary calls to the database/disk for checking the membership of an element.

Bloom filter are used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter.

Some of the projects that use Bloom filters:

  • The servers of Akamai Technologies, a content delivery provider, use Bloom filters to prevent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates.
  • Google Bigtable, Apache HBase and Apache Cassandra and PostgreSQL use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
  • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).
  • The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature.
  • Medium uses bloom filters for recommending post to users by filtering post which have been seen by user.
  • Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.

That’s it!

Hope you were able to get a gist of this beautiful data structure.Thank you for staying till here!!!

Feel free to reach out…

Linkedin: https://www.linkedin.com/in/divyansh-chowdhary/

--

--

Divyansh Chowdhary

Developer | Interested in Distributed System | Looking for meaning of Life.