Bloom filters : An Introduction
While signing up on a website you would have sometimes seen a message — username is already taken. Considering the website might have millions of users, how the website is able to validate this within a matter of a few seconds? One of the solutions to this problem is Bloom filter.
What is Bloom filter
It is a space efficient probabilistic data structure used to test whether an element is a member of the set. One could argue the use of other data structures such as hashsets to check set membership, however the Bloom filter does the same in a space and time efficient manner.
The tradeoff for this efficiency is that a bloom filter can give out false positives i.e it may say that the element is member of the set but in reality it isn’t.
A bloom filter will never give false negatives.
We will discuss more on this in the upcoming section.
How Bloom filter works
A Bloom filter is a bit array of length m. Initially, all the bits of the array are turned OFF i.e 0. The figure below shows a bloom filter of size 10 (index 1 to 10) with all the bits set to 0.
Adding an element to the Bloom filter
1. We take k hash functions. The item to be added is hashed using these k functions. The output is denoted by h₁(item), h₂(item),…hₖ(item).
2. The modulo(%) m (bloom filter size) of each output is calculated to identify k positions in the bloom filter array.
3. For all the k positions in the bloom filter, the bit is set to 1.
Let k = 2 (hash functions)
We want to add the word cat to bloom filter
Say,
h₁(“cat”) % 10 = 4
h₂(“cat”) % 10 = 7
The bits at the positions 4 and 7 will be turned to 1
Let’s say we want to add another word dog to the bloom filter
Say,
h₁(“dog”) % 10 = 2
h₂(“dog”) % 10 = 7
The bits at positions 2 and 7(already 1) will be turned to 1
Finding an element in the Bloom filter
1. Use the same hash functions to hash the value to be searched
2. Find the mod of the output with bloom filter size to get k positions in the bloom filter array.
3. Verify if the bits in those positions are 1.
Let’s verify the membership of the word dog in the bloom filter.
h₁(“dog”) % 10 = 2
h₂(“dog”) % 10 = 7
The bits at positions 2 and 7 are both set to 1, hence we can say that the word dog is a member of the bloom filter.
Let’s try to check if the word parrot is the present in the bloom filter
Say,
h₁(“parrot”) % 10 = 2
h₂(“parrot”) % 10 = 9
The bit at position 2 is 1 but the bit at position 9 is 0. Hence, the parrot is not present in the bloom filter.
False positives in the Bloom filter
It is possible that the hash function output(s) for certain elements might come out to be 1 even if the element is not a part of the bloom filter. This is demonstrated below:
Let’s try to see memership of the word camel
Say,
h₁(“camel”) % 10 = 4
h₂(“camel”) % 10 = 7
Even though camel is not a member of the bloom filter, but still it says otherwise since the bits at positions 4 and 7 are already 1. Here the bloom filter is giving false positives due to hash collisions.
Reducing false positives
To reduce false positives we need to decrease the probability of hash collisions. This can be done via the following:
1. Use a Bloom filter of larger size. Larger the size of bloom filter, more space it takes.
2. Use multiple hash functions. Large number of hash functions might impact execution times.
Limitations of Bloom filter
- The basic implementation of Bloom filter doesn’t support delete operation
- The false positive rate can be reduced but can never be made 0.
Applications of Bloom filter
- Fraud detection : as an initial layer of security banks apply bloom filters to check if transaction is of unknown behaviour eg: transaction done by a user from a completely new location or using a different phone, etc
- Databases : used by many databases for doing text search without doing costly disk operations
- Security: used to detect weak passwords, mallicious URL’s etc
Feel free to connect with me on LinkedIn for any queries.