Nerd For Tech
Published in

Nerd For Tech

Bloom filter: No means NO!, Yes means a Maybe.

Liked the title?

I could not think of a simpler way to put probabilistic behavior of a Bloom filter. In technical terms, Bloom filter is a

  • Space efficient
  • Probabilistic
  • set membership checking
  • data structure

Lets go one by one.

Set membership Checking: Bloom filter tells us whether an item is in the set of items. It does so in constant space and time O(k) where k is number of hash function. It does not need to store the entire set itself.

Probabilistic: It may or may not guarantee item’s presence. But it can guarantee its absence.

More on this later.

Space efficient: It does not store the data itself. It uses bit array to do so. E.g. you can represent 10 million records with around 40MB of space.

Data structure: We all know what a data structure is. :)

A Bloom filter uses an array of bits, initially all set to zero. Bloom filter utilizes hash functions to set the bit value of the bit-array. Hash functions used, need to be fast and independent like murmur and FNV. Many implementation have MD5, HashMix functions as well.

When we add an key in bloom filter, the key is hashed by multiple function and then corresponding output values are used to set the bit at that position to 1. While we are checking for existence of key, we again Hash the key and check whether all bits(given by hash functions) are set to 1 or not. If any of the bits are set to 0, it means key is not present.

Lets do a quick walk-through of Bloom filter using murmur2 and FNV hashing function and see how we get a false positive and No false negative. We create a 32-bit Array all elements set to 0.

We pass the string “amitsinghrathore” to murmur 2 and FNV functions which generate indexes where the bit needs to be set one. In this case 8, 16. Similarly we add “amit” and it returns the index 12 and 0. So we set the bit to 1 there.

We added few more values.

amitsinghrathore → [8,16], amit → [12,0], amitsingh → [1,31], amitrathore → [4,13], singh → [6,28], rathore → [24,6], amitkumar → [7,4])

We end up with the following array.

Now lets check what will happen if we look for a non-existing string.

arathore → [15, 19] If we look at [15] and [19] in the array we will find values [0,0]. Since we get a zero in the list Bloom filter can say with confidence that this name does not exit.

Lets check akumar.

akumar → [24,6]. In the array both 24th and 6th position contains 1. [1,1] results in bloom filter saying it maybe present, although we never added this key/name. This is called false positive. This results due to collision in the hash function.

Probability of false positive in Bloom Filter is, P = (1- (1–1/m)^kn)^k.

Where n is the number of elements stored in the array. m is the size of array and k is number of hash functions used.

Using the concepts of definite no, many application use Bloom filter to check for non existence and reduce disk read overheads. One such example is Cassandra. It uses bloom filter so that it can check whether it needs to scan the SSTables (which are on disk) or not. If bloom filter says it cant find the key then application is sure that it does not need to scan the SSTables. This reduces the disk scan operation overhead.

Bloom filter memory requirement, number of hash functions and false positive probability are dependent on each other.

m/n = -1.44 *log2(Probability of false positive)

To try out various combinations on them try this.

Links:

Python Code:

pip3 install bloom-filterfrom bloom_filter import BloomFilterblm_fltr = BloomFilter(max_elements=1000, error_rate=0.1)
assert "user1" in blm_fltr is False
blm_fltr.add('user1')
assert "user1" in blm_fltr is True
blm_fltr.add('user2')

Hope this helps.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store