Bloom filter: No means NO!, Yes means a Maybe.
Liked the title?
I could not think of a simpler way to put probabilistic behavior of a Bloom filter. In technical terms, Bloom filter is a
- Space efficient
- set membership checking
- data structure
Lets go one by one.
Set membership Checking: Bloom filter tells us whether an item is in the set of items. It does so in constant space and time O(k) where k is number of hash function. It does not need to store the entire set itself.
Probabilistic: It may or may not guarantee item’s presence. But it can guarantee its absence.
Space efficient: It does not store the data itself. It uses bit array to do so. E.g. you can represent 10 million records with around 40MB of space.
Data structure: We all know what a data structure is. :)
A Bloom filter uses an array of bits, initially all set to zero. Bloom filter utilizes hash functions to set the bit value of the bit-array. Hash functions used, need to be fast and independent like murmur and FNV. Many implementation have MD5, HashMix functions as well.
When we add an key in bloom filter, the key is hashed by multiple function and then corresponding output values are used to set the bit at that position to 1. While we are checking for existence of key, we again Hash the key and check whether all bits(given by hash functions) are set to 1 or not. If any of the bits are set to 0, it means key is not present.
Lets do a quick walk-through of Bloom filter using murmur2 and FNV hashing function and see how we get a false positive and No false negative. We create a 32-bit Array all elements set to 0.
We pass the string “amitsinghrathore” to murmur 2 and FNV functions which generate indexes where the bit needs to be set one. In this case 8, 16. Similarly we add “amit” and it returns the index 12 and 0. So we set the bit to 1 there.
We added few more values.
amitsinghrathore → [8,16], amit → [12,0], amitsingh → [1,31], amitrathore → [4,13], singh → [6,28], rathore → [24,6], amitkumar → [7,4])
We end up with the following array.
Now lets check what will happen if we look for a non-existing string.
arathore → [15, 19] If we look at  and  in the array we will find values [0,0]. Since we get a zero in the list Bloom filter can say with confidence that this name does not exit.
Lets check akumar.
akumar → [24,6]. In the array both 24th and 6th position contains 1. [1,1] results in bloom filter saying it maybe present, although we never added this key/name. This is called false positive. This results due to collision in the hash function.
Probability of false positive in Bloom Filter is, P = (1- (1–1/m)^kn)^k.
Where n is the number of elements stored in the array. m is the size of array and k is number of hash functions used.
Using the concepts of definite no, many application use Bloom filter to check for non existence and reduce disk read overheads. One such example is Cassandra. It uses bloom filter so that it can check whether it needs to scan the SSTables (which are on disk) or not. If bloom filter says it cant find the key then application is sure that it does not need to scan the SSTables. This reduces the disk scan operation overhead.
Bloom filter memory requirement, number of hash functions and false positive probability are dependent on each other.
m/n = -1.44 *log2(Probability of false positive)
To try out various combinations on them try this.
Bloom filter calculator
Calculate the optimal size for your bloom filter, see how many items a given filter can hold, or just admire the curvy…
A Bloom filter is used to create a probabilistic guess on whether an item is in a data structure, and was created…
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising…
pip3 install bloom-filterfrom bloom_filter import BloomFilterblm_fltr = BloomFilter(max_elements=1000, error_rate=0.1)
assert "user1" in blm_fltr is False
assert "user1" in blm_fltr is True
Hope this helps.