Bloom Filters

Sandeep Verma
5 min readSep 29, 2019

Introduction

Google Chrome warning user when the user visits a malicious website

Have you ever come across such a warning from Google while visiting some websites? If you did, then you have already seen the application of bloom filters.
You also might have faced the following error on the sign-up page?

Username already exists or week password error

So what exactly is bloom filter and how does Google leverage it in its application?

Bloom filters in detail

List, tree, sets, and hashtables are well-known data structures that are designed to store data in such a manner so as to perform certain kinds of operations efficiently.

Bloom filters are the probabilistic data structure designed to tell us rapidly whether a value ‘probably’ exists in a data store. Moreover, it’s also quite memory efficient. We are giving up certainty at the cost of a faster answer. Bloom filters can give false positive answers(i.e. an element exists in datastore, but in fact, it doesn’t) but it never gives false negative answers (It never proclaims that a value is not in the datastore, even though it exists). The later feature of bloom filters is the reason why many applications employ them heavily.

One imminent question that might be coming to your mind would be why bloom filters and why not hashtables or hashmaps?

Hashmaps are indeed a good data structure to answer whether an element exists in the set or not. But it comes at a cost. If there are 1M unique elements in data storage, then we are required to store all of them to answer whether particular elements exist or not. However, with bloom filters, we can answer the same query in a memory-efficient way (Though it will be a probabilistic answer)

The base data structure employed by a Bloom filter is a Bit Vector of n bits. To demonstrate this here is a bit vector of 10 bits (with all elements as zero) :

Credits for Image: GeeksforGeeks

Now each element we will pass through some m number of fast hash functions. hash1(x), hash2(x), hash3(x)..hashm(x)

hash1(geeks)% n =1

hash2(geeks)%n =4

hash2(geeks)%n =7

So ‘geeks’ results in bits set at places 1, 4 and 7

Credits for Image: GeeksforGeeks

Similarly, for“nerd”, hashes are as following :

hash1(nerd)% n =3

hash2(nerd)%n =5

hash2(nerd)%n =1

we set the bits at indices 3, 5 and 4 to 1. Those bits which are already set to 1, are left as it is.

Credits for Image: GeeksforGeeks

Now let’s say we get a call to answer whether ‘cat’ exists in datastore or not.

hash1(cat)% n =1

hash2( cat)%n =3

hash2(cat)%n =7

Credits for Image: GeeksforGeeks

We check the bits at the index of 1, 3 and 7. Since all bits are set to 1, the bloom filter says ‘cat’ is ‘probably present’ in the data store (False positive result). Even though the cat was never added to datastore.

Instead of getting 1, 3 and 7 as the answer from hash function for ‘cat’, if we would have gotten 2, 3 and 7, then we would have answered with 100% certainty that cat doesn’t exist in the datastore(As bit is not set at index 2). This way bloom filters never gives false negative answers (Thus can be used to tell with certainty that element doesn’t exist in the data store)

Hence, if we search for a key and any of the hashed index is not set to ‘0’, then the value is definitely not in the data store. If all hashed indexes are set to 1, then key ‘may’ be present in the datastore (This is due to the fact that many different keys can make the same index set)

Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit, a full check of the URL is performed

Gmail uses bloom filters to check whether an username is already taken or not from list of billions of username. With a good bloom filter implementation Gmail reduces false positives to below 1%. Thus it prevents 99% networks calls to actual datastore

Gmail also employs bloom filters to check whether password provided by user is already there in the huge set of weak passwords.Thus, presents an error to user when user tries to set a weak password

Other applications of bloom filters :

  • Databases using LSM trees, like HBase, Cassandra, and BigTable uses bloom filters to reduce searching into different segments (on disks) for the non-existent rows or columns
  • Recommendation engine of Medium uses bloom filters to filters out articles already seen by a user
  • Quora filters out stories already seen by people using bloom filters

How to implement bloom filters :

To implement a bloom filter you will require some kind of bitmap and some efficient hash functions. For the hash function, we can use MD5 algorithms which generates fairly long hash and then extract a few bits of smaller hash values.

We can also use a cryptographic hash function which provides stability and guarantee.

However, MD5 is quite slow for hash calculation and the cryptographic hash function is expensive. In real-world we use non-cryptographic hash functions that provide good performance. Faster hash function implementations are murmur, the fnv series of hashes, Jenkins hashes, and HashMix.

A simple bloom filter implementation in java can be found here.

Don’t forget to clap if you liked this article!

References :

--

--