Bloom Filter: A Probabilistic Data Structure

Eyitayo Ogunbiyi
9 min readApr 22, 2019

--

Recently, I was trying to implement a new feature on a REST API I’m working on. The feature is a familiar one we see while trying to register on social media platforms:

To check if a username is available / taken.

This might seem like a fairly trivial endpoint to write. I was working with Django so I wrote something along the lines of this :

User.objects.filter(username=body[‘username’]).exists()

It worked. Pretty easy. A client could now just send a POST request to this endpoint, with whatever username and the client could find out whether the username was available or not. However, I started to wonder how Django handled this under the hood. Whenever using language or framework features, I always endeavor to ask myself: “How would you implement this yourself, from scratch ?”. The problem we are trying to solve here can be simply put as — Given a set of items, check if a particular item exists within the given set.

There are a couple of ways that we can choose to think about solving this problem. One technique would be to simply use a linear search through the array and terminate whenever we find the element. We can do this in O(N) and if our method ever gets used by applications where the set to be searched through is large, it might become significantly slow.

We can still improve our performance by using a binary search implementation, provided our data is sorted. We reduce our complexity by a huge factor and take it down to O(log N).

Much better, but the ultimate question still remains —

“Can we do better than? ”

This question brings us to the main topic of this post, Bloom Filters

What are Bloom Filters?

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. — Wikipedia

From that definition, we can get a basic idea of what a bloom filter is, and also how it can possibly help us to solve the problem I started this article with. However, we might still need to dig a bit deeper into bloom filters to gain more understanding of how it helps us achieve our goal. Firstly, what makes it a ‘probabilistic data structure’.

A probabilistic data structure is one which can’t provide a definite answer to your query. Instead, they give you an answer to your query, with a reasonable degree of certainty. In the case where we have a bloom filter storing our user data (remember that bloom filters can be used to test the presence of an element within a set), if we ask a bloom filter whether a particular username has been taken, we get an answer extremely quickly.

The probabilistic characteristic comes into to play because there’s a small chance that we have false positives. When a username is not actually taken, our bloom filter might say the username exists. Now, is this always a bad thing? It really depends on our use case. Software engineering is all about making the right tradeoffs. We get a lot of savings in terms of memory when using a bloom filter at the expense of losing our 100% certainty on queries.

In this scenario, if we tell a user that the username he’s about to use is taken, there’s a small chance our bloom filter is wrong but this something we can live with. Users probably care more about your application being fast than their username choices. However, with bloom filters, we are 100% certain that we will never have a false negative. Anytime our user bloom filter tells us that this username is not taken, we are certain that it’s not taken. This is usually very important as usernames are usually unique and we don’t want to have 2 users with the exact same username. That would be bad.

In summary, the key features of a bloom filter are :

  1. It’s pretty fast (Usually we have a query time of O(k), we’d find out what k is shortly)
  2. They’re lightweight. We can basically condense a lot of data into a much more smaller space by using this data structure ( Has a space complexity of O(m) )
  3. False positives might occur
  4. False negatives never occur

Now let’s look at how the bloom filter does its thing.

How do bloom filters work?

Under the hood, the main data structure powering the bloom filter is a bit array (or bit vector) with all bits set to 0 initially. We need to make this bit array of size m

The initial state of our bloom filter with all bits as 0

A good question would be on how to choose the appropriate value of m. We’d consider a good way to select this value based on some other parameters.

Great. Now, we have our bit array down and the next question you probably have would be how we can add values to this bloom filter. We use multiple hash functions to do this (multiple hash functions are used to add a single value). A hash function can be thought of as a function which maps some data of any size to a fixed size output. This means that if I put in just a single character into a hash function, and I put in a really long string, the output I get would be the same length/size. Another desirable property of hash functions in application to bloom filters is that they should be pretty fast to compute.

Hash functions are very integral to the way bloom filters work and as a matter of fact, the value k we mentioned under the key features of bloom filters is simply the number of hash functions used by a bloom filter.

Whenever we want to add an element, we simply run the element through each of our k hashing algorithms. Whatever we get as output from the hash function might be greater than m (remember that m is the size of our bloom filter and this our indices should be less than, to avoid overrunning our bit array) so we can keep the hash in our desired range by doing:

hash_function(username) % m = idx

I found these images that really help in getting the point across on how the strings get mapped to indices in the array.

https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

We can assume that we are using 3 hash functions (k = 3) and they all take in ‘geeks’ as their input. After performing the modulo (%) operation on each result, the input string gets mapped to three indexes — 1,4 and 7. Then we simply set these indices as 1

To keep adding items, we just keep repeating the process

  1. Run whatever input you want to store through your k hash functions.
  2. Perform the modulo operation on the hash function output to obtain your indices
  3. Set the values held at the indices as 1
https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

On a new input, we get indices 3,4 and 5. Even though 4 was marked previously, we leave it as 1.

To test if a username (or generally, any item) has been added to our bloom filter, we use each of our k hash functions to hash the username and from the output hash, we can get the corresponding indices. We then check our bit array to see if the values at the corresponding indices have a value of 1. If we still have a value of 0 at any of these indices, we can say with all confidence that the username we just tested is not present in the set.

However, if all of the indices have values of 1. We can say “Probably, the username exists in the set”. The reason for this uncertainty boils down to the fact that bloom filters are probabilistic in nature. In cases where we are uncertain, we can now actually go to check on our servers. Since false positives rarely occur, we’ve drastically reduced the number of times we have to make a trip to our server.

It is possible that a new username we are testing gives a false positive because the indices in the bit array have already been previously marked by another input.

Take a look at the following image, here we are trying to test if the string “cat” exists in our bloom filter.

https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

While trying to test, we get the indices 1,3 and 7 as the corresponding indices for the string. However, these indices have previously been marked by “geeks” and “cat”. This is a perfect illustration where a bloom filter can give us a false positive. We have to try to keep the false positive rate low for a bloom filter, if every test returns a false positive then we lose any possible benefit we could have gotten from the bloom filter.

We can keep the false positive rates in check by carefully setting the value m (size of our bit array) and also k (the number of hash functions).

Keeping False Positives In Check

There’s a mathematical formula that exactly describes how we can calculate the probability of a false positive. This probability is dependent on

  1. Size of bit array (m)
  2. The number of hash functions used (k)
  3. Number of elements held by your bloom filter (n)
You can get the full derivation here -> Wikipedia

With some further mathematical manipulation, we can get expressions for m and also for k.

These formulae can guide us when attempting to implement our bloom filter for a specific application. We have an idea of the number of elements (n) which would be stored of by our bloom filter, we can then set a value for p that meets our requirements and can easily plug into these equations to get optimal values for k and m.

We also want to ensure that the k hashing algorithms we use are reasonably fast to compute since the speed of insertion and testing is directly dependent on how fast we can compute all of the k hashes. Even though algorithms like SHA256 and other cryptographic hashes are popular, we tend to use faster or even non-cryptographic hashing algorithms when trying to compute the indices for bloom filter applications. Some of the hashing algorithms used are Fowler–Noll–Vo, Murmur.

Conclusion

I think bloom filters are an excellent data structure even though I haven’t used one directly in a production environment. Companies which handle large scale definitely need to reduce the number of times they’ve to hit their servers, especially when it can be easily avoided with the use of bloom filters.

Postgresql, Cassandra and Google BigTable use bloom filters underneath to remove the need for looking up non-existent rows or columns.

Another application of it is right here on Medium. Medium uses bloom filters to avoid re-recommending articles to users.

I would be writing a basic implementation of a bloom filter and putting it up on my GitHub soon. Asides the theoretical parts of data structures, implementations are also great for solidifying concepts.

I hope you’ve been able to learn a few things from this article :)

Clap it if you did 😀

You can check out my Github: tayoogunbiyi

Or follow me on twitter: eyitayoo_

--

--