Understanding Bloom Filter Part-I

Abhishek Jha
Analytics Vidhya

--

This is a two part article. In the first part we will try to develop the intuition behind Bloom Filter. In the second part we will use the learning from first part to build the data structure for Bloom Filter.

Motivation

In a fairly complex distributed system we encounter problems which can be reduced to finding existence of an element in a set. Bloom Filter is a data structure which helps in solving exactly this problem.
Here is one such application of Bloom Filter from wikipedia.
Akamai noticed that roughly 75% of web downloads were only downloaded once. They decided that it would be advantageous to cache a web download the second time that it was downloaded, rather than the first. They used a Bloom Filter to figure it out. Because of the Bloom Filter, you can do the cache checking without having to store the contents, or even hashes of the contents.

Use case for this article

Lets define a sample use case for the purpose of our article.

A user will enter a proposed password and our system should quickly respond if that proposed password is acceptable or not.

The first thoughts could be to have a RegEx to conform it to certain rules. This can help us in giving some structure to passwords. Still there could be some unacceptable passwords which are too easy to guess.
For example, these might be words that are in the dictionary of a language. So for the sake of this article we can assume that there is a defined list of passwords which are not acceptable.

Hence the requirement translates to maintaining a database of unacceptable passwords and quickly check whether the proposed password is in the database of unacceptable passwords or not.

Why unacceptable passwords ? Why not maintain acceptable passwords?
The universe of possible passwords is too huge to maintain. For example if we simply look at passwords as strings of length N, then this set is of size 52^N(taking only upper case and lower case) ,

Formalising Requirements

Lets formalise the requirement.
U represents the universe of possible passwords.
S ⊂ U and represents the set of unacceptable passwords.
Lets assume X is a proposed password in this setting.

So to answer whether X is an unacceptable password we have to answer the query for X ∈ U , does X ∈ S ?

Now we want to build a data structure or hashing scheme, which answers these queries quickly.

The intuition behind Bloom Filter

Traditional Approach

Let’s first look at how the traditional hashing scheme known as chain hashing, works in this setting.
In order to maintain the set S, we’re going to use a hash table H of size n. In chain hashing, this table H is an array of linked lists of size n, and H[i] is a linked list of those elements(unacceptable passwords) whose hash value is exactly i.
We’re going to use a hash function h(X) which maps elements in U to elements in H. We’re going to assume that h(X) maps to a random index. Moreover, we’ll assume this random map is independent of all other hashes. So where h(X) maps to is independent of where any other element of the universe maps to.

Now to insert an element into this subset S we simply find its hash value index, then we add the element onto the linked list at that particular index.

To do a query, we simply go to the hash value index and we look at the linked list to check whether it’s there or not.

Now if you think of this hash table indices as bins and elements in S as balls then what this hash function is doing is, its assigning these balls into random bins. We will use this analogy later to figure out probability of maximum balls in a single bin aka max load.

Now it will be useful to have a little bit of notation before we move on.
This set U is huge, and we’ll denote its size by capital N.
The size of Hash Table was n.
We’ll use m to denote the size of this database S, that we’re maintaining.
And typically our hash table size is at least the size of the database we’re maintaining. To summarise

|U| = N >> |H| = n ≥ |S| = m

So n is at least size m, and our goal of course is to try to maintain this database as not much larger than m.

Time Analysis

Let’s look at the query time.
How long does it take us to answer a query of the form?
Is X in subset S ?.
Now, in order to answer this query, what we have to do is look at the hash table at the index i, which is H[i], and then we have to go through that entire linked list and check whether X is there in that linked list.
So, the time is proportional to this size of this linked list.

What’s the size of this linked list?
Remember the balls in to bin analogy. This is similar to finding max load. Roughly the max load is O(log(n)), with high probability. We can have more tighter bound but for simplicity we have chosen a loose bound. You can read more about it here. Of course, in the worst case it might be O(n) , but that’s an unlikely event.

The time it takes us to answer a query is proportional to the load size at the hash value. With high probability, the max load is going to be O(log(n)), which means the query time in the worst case, it’s going to be O(log(n)) with high probability. Now, when n is huge, then O(log(n)) might be too slow for us.

Improving Query Time

So how can we achieve faster query time?
Well, one way is to increase the size of our hash table. In order to decrease this max load from O(log(n)) to order O(1) we can increase the size of the hash table from O(m) to O(m²).
Now, that’s quite a large price to pay.

So let’s see if there are simpler ways to achieve reductions in the query time.

The power of two choices

What we’re gonna do now is instead of using a single hash function we’re going to choose a pair of hash functions h1 and h2, each of these hash function Maps elements of U of our possible passwords into our hash table of size n. Now we will assume that these hash functions are random so each element X in the universe of possible passwords maps to a random element of the hash table. h1(X) is random, h2(X) is random and these are independent of each other and independent of the other hash values.

The first question is, how do we insert an element, a possible password, into our dictionary of unacceptable passwords ?
Lets assume in our hash table we are also maintaining the size of linked list at each index. First thing we do is compute these two hash values h1(X) and h2(X). Then determine which of these two is least loaded by comparing their size. Then we can add X into that appropriate linked list and increment the size of that linked list. So this can all be done in O(1) time for an insertion.

Next question is, how do we do a query whether an element Y (proposed password) is in our dictionary of unacceptable passwords?
We start off the same as an addition. We compute the two hash values h1(Y) and h2(Y). These are the two possible locations for Y. We have no way of determining which of these two locations it might be in if at all.
We check the linked list at h1(Y) and h2(Y). We look for Y in both of these linked lists. If it’s in either of these linked lists then we know that Y is in the dictionary else Y was never inserted into a dictionary of unacceptable passwords.

So how long does it take to do a query ?

The query time now depends on the load at h1(Y) and h2(Y).
So if we have an upper bound on the maximum load then the query time is twice the maximum load. Now if m equals n, so the size of our dictionary of unacceptable passwords and the size of our hash table are the same
then the query time is going be O(log(log(n)). In this scenario so just changing from one hash function to a pair of hash functions our query time goes down dramatically from O(log(n)) to O(log(log(n))) and there’s no extra cost in terms of the space. This is a substantial gain because log log n is quite small, even for very large n. So this is almost like O(1),it’s very close, it’s a very small quantity.

After seeing this result, you might say, “Well, why choose two random hash functions.Let’s choose three random hash functions and maybe we’ll get log log log n.

Well, it turns out that the big gain is from one to two, and after that there’s not much gain. In particular, if you choose d random hash functions and assign the ith element to the least loaded of all of these linked list, then the max load is going to be O(log log(n)/log(d)) where d is at-least 2. So the improvement with d is very small.

Now we can finally describe Bloom filters which is present in second part.

--

--