Bloom filters

Dhruv
Programming Club, IIT Kanpur
8 min readDec 22, 2022

Let us begin by imagining a situation — - This blog becomes really popular! So popular that thousands of users at a time send a query to fetch it. Now, this may give latency issues due to a centralised server. That's where proxy servers come into the picture.

Proxies act as a gateway between the user and the source server, storing (or caching) the server’s resources. When the user files a query to access a resource, the proxy checks to see if it has a cached recent copy of the resource and immediately delivers it to the user. If not, the resource is retrieved from the provider of web content and if the proxy gets many requests for the same resource, it’ll simultaneously cache it.

This blog isn’t about proxy servers.

How the proxy that caches millions of queries knows if the client query is not present in the cache to further send it to the isp, that too in O(1) time as space efficient as possible?

To solve this, a special data structure — Bloom filters are used, with their primary use case being, to test whether an element (query) belongs to a set (millions of cached queries).

Sounds like something that can be easily done by Hashmaps, BSTs or even Tries if we are concerned with strings!

Let’s take a look at some use cases where bloom filters can be put to use to get a clear understanding of our requirements: —

  1. Weak password Detection:: A list of weak passwords can be maintained and the user can be warned while creating/updating his password, in case of a match to prevent cyber attacks like dictionary attacks. A password not belonging to the list won’t be at least weak.
  2. Safe Browsing:: Browsers maintain a very long list of malicious URLs, everytime we try to fetch a new URL, it needs to be verified it doesn’t belong to the list.
  3. Duplicate Usernames:: While creating a new account on social media, many times we need to pick a different username as the previous one is already taken.
  4. Databases:: The disk lookups for non-existent rows or columns need to be reduced to avoid costly disk lookups, doing so considerably increases the performance of a database query operation.
  5. Tracking User Traffic:: If a visiting user has ever visited before. (is it a new or existing user?)
  6. Proxy caching:: There are multiple proxies in a network. When a user accesses a website, proxies interpret the request and respond to it on the behalf of the original server. All proxies cache resources of some websites but all don’t necessarily cache the same resources. A Proxy hashes all URLs of its cached resources and maintains a record of them. Then all proxies share their URL record among themselves, basically telling ‘I have cached these websites’.
    Consider requesting ‘A.com’. The GET request will go to the nearest proxy P1. It will check its cache record, if that proxy doesn’t have the cache, it’ll find a proxy in the network with the desired cache of ‘A.com’. If no proxy has the cache, the request will be sent to ISP. This scales well when there are millions of the same requests. The catch is, cached internet history is large, and we like our search to be quick.

Noticing a general pattern in the above, we are given a large pool of data, which can be a list of billions of users or passwords and we are only concerned with if some element exists within that pool or not.

Hashmaps offer lookups in a really convenient O(1) time complexity but are less space efficient and a strong hash function is needed to minimise collisions, it won’t scale well practically when data is immense.

Let’s see how bloom filters overcome this.

Working of a Bloom Filter:: —

Start by considering an array of m bits, all set to 0. We have k hash functions (all generate a uniform random distribution), each of which hashes an element to one of the indexes of the above array (i.e 0 to m-1).

To add an element, we feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.

To query for an element (test whether it is in the set), we feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set; if it were, then all the bits would have been set to 1 when it was inserted.

For eg, we have 3 hashfunctions, they’ll map each element in set {x,y,z} to three indexes each. When we query for ‘w’ we get one position for w which is not set, hence ‘w’ is definitely not present.

Consider querying for ‘u’, which is not in our set but has all its bits set. The query will give a ‘false positive’ for this case.

The above example clears one thing, in a simple bloom filter: —

  1. If a query response is negative, then the element is definitely not in the set.
  2. In the case of positive, we can’t be sure. It may result in a false positive.

That’s why bloom filters are known as probabilistic data structures.

Now coming to why bloom filters are prefered over hashtables in certain tasks:: —

Our primary needs are to verify if something does not exist in a very large pool of data, quickly and space efficiently.

  1. Space utilization: Hashtables store the element in the bucket array, whereas bloom filters utilize a bit array, only store whether the elements exist or not (it is just an array of 0 and 1s). They fit well for tasks concerning with only verifying the absence of an element. The number of bit boxes in a filter can be decided by us.
  2. Collision handling: We have to minimize collisions in hash tables so must choose multiple hashes or a strong hash or solve it with probing. A bloom filter uses many hash functions. There is no need to handle collisions, which can make it computationally lighter.
  3. A simple implementation of Bloom filters in python —
import mmh3

class Bloom_filter(set):

def __init__(self, m, k): #initialises a bit array of 'm' size and
#specifies the number of hash-functions,'k'
self.m=m
self.k=k
self.bit_array = [0]*m
def add(self, item):
for i in range(self.k):
index = mmh3.hash(item, i) % self.m
self.bit_array[index] = 1
def query(self, item):
for i in range(self.k):
index = mmh3.hash(item, i) % self.m
if self.bit_array[index] == 0:
return False
return True
if __name__ == '__main__':
bloom = Bloom_filter(100, 10)
data = ['java','cpp','javascript','kotlin','swift','rust','ruby',
'agile','algorithm','application','adaptive','bootstrap','backend',
'github','http','jQuery','python','ssl','version']
for word in data:
bloom.add(word)
test_data=['swift','version','emaad','scala']
for word in test_data:
if bloom.query(word):
if word in data:
print('{} is in bloom filter as expected.'.format(word))
else:
print('{} is not present in the entered data. Bloom filter says ohterwise, False Positive!'.format(word))
else:
if word not in data:
print('{} is present neither in the entered data nor the bloom filter as desired.'.format(word))

Probability of false positives: —

To improve our filter, we want to reduce the false positives.

Some intuition first:: —

  1. If ‘m’ is really large, the set bits will be spread out enough such that hashes of a non-existing element will have less chance to have all its bits set. Hence, decreasing false positives.
  2. Similarly, consider if we have only one bucket, all elements, are hashed to that one bucket and the chances of error become 100%.
  3. Having too many hash functions with limited ‘m’ will increase the number of bits set per element, increasing the risk of false positives. Whereas, a single hash function will lead to too many collisions, especially while querying because a false positive will be triggered only if all of the bit positions you inspect are set, and the number of bit positions you inspect is equal to the number of hash functions.
False positive probability formula

You may refer to this video for the derivation.

Making m=1, gives the false positive probability of 1, whereas making it tend to infinity, gives the probability of 0.

To optimise k, we need to minimise the above expression. By differentiating it wrt k, we find the expression for optimal k: —

’n’ is the number of elements in our set.

Now, that we got the relation for the number hash functions we should use, how to decide on ‘m’ for our filter? There’s a general convention that we want to keep at most 70–80 per cent bits set. By choosing our desired false positives probability we can find ‘m’.

One major issue that we have not addressed in this blog is, simple bloom filters lack delete operation! It concerns our example 4) of proxy caching, as caches are frequently updated and not frequently accessed data should be deleted but we can’t delete by simply reversing a bit as it may also be a hash of some other element. To solve this, Counting Bloom filters are used.

In conclusion, the most significant advantage of Bloom filters over other data structures such as self-balancing trees, tries and HashMaps is in terms of space utilization. Others store each element in its entirety. The storage requirement in this case can range from a few bits to several bytes. Hence they are used to effectively speed up answers in a key-value storage system. If this blog got recommended to you by Medium itself, it’ll make use of bloom filters to make sure you don’t get the same blog you’ve read recommended again.

References:

Wikipedia
https://www.youtube.com/watch?v=bgzUdBVr5tE

Authors: Dhruva Singh Sachan
Editor: Parinay Chauhan

--

--