Bloom Filter: A Probabilistic Search Approach in Data Structures

Published in

Techspace

7 min readAug 12, 2019

Let’s start with discussing a Problem

You want to create a Gmail account so you typed your name and got an annoying message “Your username is already taken”, And now to get passed through this you decided to concatenate it with your birth year still got the same message and now you are really frustrated and after trying some more strings you got the username and account.

Ever wonder, how the Gmail knows in an instant whether the username you typed is available or not?

After reading this you will surely understand this.

The Seductive Business Logic of Algorithms | Data Driven Investor

Certain machine behaviors never cease to amaze me. I'm astounded by their ability to learn from their accomplishments…

www.datadriveninvestor.com

But, before diving into our main topic, First, let’s make sure that you know all the prerequisites.

1. Bit Array

This is an example of bit array in which the size of array is 10 and all the bits is set to 0.

If you want to keep track of the numbers from a given range, then you want to use Bit Array. As soon as you got a number, just make the bit of that index equals to 1(or set). And, if in future you want to check whether you have seen that number just go to that index which is equal to your number and see the value if it is 1 then you have previously seen it otherwise not.

This is all you need to know about Bit Array to understand Bloom Filters.

2. Hashing

Hashing is the transformation of strings(in our case) into a usually shorter length number that represents this particular string.

There are many hash functions like SHA-1, murmer3, FNV series hash, Jenkins Hash, etc. but we are going to choose the hash function which is very fast and uniformly distributed across its values because we want to validate a username fast(no one wants to wait for minutes just to see if he/she can take this username).

So, for this purpose, we are choosing murmer3 as already implemented as an external python library called mmh3.

for example, I want to find my name’s hash code:

Let me explain this, mmh3.hash is a hash function that returns a 32-bit signed integer hash code with the first argument as a string which we want to transform and the second argument is the seed which is used to get different hash code every time we run this command. You might ask, isn’t we want to get the same hash code every time we enter one string?

Yes, we want to get same hash code every time we enter the same string but in Bloom filter, we want different hash codes by varying seed’s values for the same string so as to reduce the number of false positives. In our case, false positive means that when you entered a string to validate, it gives you the message “username already taken” in spite of the fact that it is not taken yet.

If we enter same string with same seed value then we always get same hash code.

As you can see:

We are getting same hash code= -822763081 every time as seed value is always equal to 1.

But, later you will see that as we change the seed values with everything remains the same then we will get different hash codes.

Now, that you have learned all prerequisites, Let’s dive straight into Bloom Filters.

Bloom Filter

It is a probabilistic type data structure that is used to test whether a given string of characters is present in the set or not. Since it is a probabilistic Data structures it might give you false-positive results. Luckily, we can reduce this probability to our needs since no one wants to change their favorite username to second most favorite username for no reason.

Implementation of Bloom Filter:

First, we need to make a bit array with size m all index values set to 0.

We need k hash functions to calculate hashes for a given string. We can do this with just 1 hash function more hash functions will increase the accuracy of the prediction.

But, we don’t want to take many hash functions since it will increase the latency of the process.

This is a trade-off between accuracy and processing and we need to find a sweet spot for our data structure according to where you are using it.

Let’s say we have to find the hash codes for the string “bloom”:

here, I have taken 3 hash functions with 3 different seed values but with same string

From the image, we can see that values are very long so to reduce the size to make it fit for our bit array we can use modulo function with the value of size of the bit array.

As soon as we get our required numbers [3, 8, 9], we set the corresponding bits to 1.

As, the process goes on and on, the bits get set to 1.

Now, it’s time to check whether a given string is in set or not.

We check for the string “cat” :

We get hash codes=[1, 4, 8] for the string “cat”

Now, we see if the bits at position 1, 4 and 8 all set to 1 if it does(here, It is clearly not but in case if it does) then,

“ The string probably be present, with high probability”

The reason it doesn’t sound sure is that when almost all the bits already set to 1, there is a probability that all the 3 indexes might be 1 due to different previously entered strings.

To state my point more clearly, suppose we get to position 1 from string 1, position 4 from string 2 and position 8 from string 3 but cat was not previously entered. It is the bad luck for the string “cat” that its hash codes happen to be the collection of previously typed hashes of different strings.

But, It can assure you one thing that there will be no false negatives. It means that if you entered a username that is certainly present in the set then it will never let you get through that username. And, it is because of the fact that if a username is already been taken by someone then all the bits were already set to 1 corresponding to that string. Now, if you check those indexes you will always get the same result(i.e. username is already been taken).

We have seen the working of Bloom Filters, Now it’s time to delve into Mathematics behind its accuracy and maintenance:

Remember from the definition of Bloom Filters, I told you that we can configure the probability of false positives in our data structures, now is the time to make that clear.

Probability of False Positives:

Notations: k= number of hash functions, m is the size of the bit array, n is the number of strings inserted.

The probability that a certain bit is not set to 1 by any of the hash functions is:

After inserting n strings to our set, the probability that a certain bit is still set to 0 is:

Now, the probability that a certain bit is set to 1 is:

Finally, for all the bits set to 1 by all the hash functions is:

The number of hash functions to be taken, to minimize the false positive probability, if m and n taken as constant is:

If the expected number of strings and desired false positive probability is known, then we can optimize the size of the bit array(m):

Real World Use cases:

If you have written a story, then you might have noticed that before publishing the story Medium asks us to tag the story with 5 strings. The reason behind the tagging is this itself. Medium uses Bloom Filters to recommend stories to its members.
Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.
Google chrome uses Bloom Filters to identify malicious websites by their URLs.
Bitcoin uses Bloom Filters to speed up wallet synchronization.