Understanding Bloom filters- Part-II

Abhishek Jha
Analytics Vidhya

--

We’re going to describe a new data structure that has faster queries.
In the traditional hashing scheme that we previously described,
the query time was O(log(n)) for the simple scheme or O(log(log(n))) for the more advanced scheme which used the power of two choices.

Here we’re going to achieve query time O(1).So constant query time, and this is guaranteed. Recall that the other query times were probabilistic statements. In the worst case it can be O(n).

This data structure will be very simple and it will use less space than before.
There are no linked lists or anything like that. It will just be a simple binary array. Now there are a lot of benefits. It’s simpler, less space, faster queries.

Sounds too good, whats the catch!

Now there must be some cost for this simplicity and this faster time.
Well, this scheme is not always correct. Occasionally, there are false positives, and this happens with some probability that we’ll analyse.

What exactly do we mean by false positive?
We have an element X which is not in our dictionary of unacceptable passwords. So this is an acceptable password, but our algorithm occasionally says, yes this X is in the dictionary.

In this setting false positives are acceptable. Why?
Because we have an acceptable password but we say that the password is unacceptable, it’s in our dictionaries.So somebody types in a password and we say, no that’s not allowed. Ideally, it should have been allowed.
So then the user has to enter a new password.
But in exchange for these false positives we have guaranteed query time.So we answer the question of whether it was an acceptable password or not quickly.

False negatives that would have been a big cost, is unacceptable in this setting. When we have an unacceptable password we definitely want to say it’s unacceptable.

So in this setting it’s reasonable to have false positive with some small probability that we’ll try to bound.In other settings it may be unacceptable to have false positives, in which case bloom filters might be a bad idea.
So this is not a universal scheme. You have to look at your setting and determine whether the price of having a simpler and faster scheme is worth the cost of having false positives.
Is it acceptable to have false positives with some small probability?

Requirements

What are the basic operations that our data structure is going to support?
First operation is insert X.
So given a possible password X, we want to add this password into our dictionary of unacceptable passwords.

The second operation is a query on X,
If this proposed password is in our dictionary of unacceptable passwords, then we’re always going to output YES, so we’re always correct in this case.
The problem is that, when this proposed password is not in our dictionary, so it is an acceptable password, we usually output NO and we have to bound what we mean by usually.
But occasionally, we’re going to output YES. So when this proposed password is acceptable, occasionally we’re going to have a false positive and say YES it is in the dictionary of unacceptable passwords, so this password is not allowed.
So we have false positives with some small rate and we have to bound that rate and see what it looks like.

Understood the catch, Lets define the data structure

The basic data structure is simply a binary array H of size n. That’s the whole data structure.

We’re going to start off by setting H to all zeros. So all of the n bits are set to zero. As before, we’re going to use a random hash function which maps elements of the universe of possible passwords into our hash table of size n.

How do we insert an element X of possible password into our dictionary of unacceptable passwords?
First off we compute its hash value, then we set the bit in this array to 1 at that hash value. So we compute h(X) and we set H[h(X)] = 1.
Now it might already be 1, in which case we’re not doing anything.
So the bits only change from zeros to ones. We never change them back from ones to zeros.
That’s one of the limitations on this data structure. There is no easy way to implement deletions, because we never change bits from ones to zeros.

Now how do we do a query?
How do we check whether an element X is in our dictionary S?
We compute hash value of X, and we check the array.
If the bit at this hash value is 1, then we output yes. We believe it is in the dictionary S.
If it’s 0, then we’re guaranteed that no it’s not in the dictionary.
Because if it’s zero that means we definitely did not insert it.
If it’s 1, then we think, we might have inserted it but we’re not sure.
Some other element might have been inserted at that hash value, and we have no way of checking whether it’s X that was inserted at the hash value.
Because we’re not maintaining a linked list at this point.

Why did that happen?

We have some element X which we do a query on. It’s not in our dictionary of unacceptable passwords, but there is some other element Y which is in our dictionary of unacceptable passwords. And these two elements, X and Y, have the same hash value, h(X) equals h(Y). So when we inserted Y into our dictionary, then we set this bit at this point to 1. Then when we do the query on X, this bit is already 1. So we think or as far as we know, X might be in our dictionary. Now we have to output yes because it might be there. But in fact it is no, because it was not inserted. Y was inserted with the same hash value.That’s how false positives arise.

Sounds like an issue

How can we improve it?
Well we can try to use our power of two choices idea that we used before in our traditional hashing scheme. So what are we going to do?
Well instead of using one hash function, we’re going to use two hash functions.
Now in the traditional scheme, there was a big gain from going from one hash function to two hash functions, but then going from two to three or three to four, was not much of a gain. But here, this is a slightly different setting and there’ll be a big gain possibly going from one to two but even for two to three there might be a gain. And it’s not clear how many hash functions to use and we’re going to try to optimize that choice of number hash functions.
So we’re going to allow, instead of two hash functions, k hash functions.
So we want to generalize this scheme to allow k hash functions and then we’re going to go back and figure out what is the optimal choice of k, the number of hash functions.

Let’s look at the more robust setting where we allow k hash functions, and how do we modify this data structure to accommodate k hash functions.

So now we have k hash functions instead of just one, h1, h2, up to hk.
We’re going to initialize our hash table H to all zeroes.
So all of the bits, the n bits of H are set to zero.

How do we add an element X?
Previously we computed this hash value h(X) and we set that bit to 1.
What are we going to do now?
Now we compute the k hash values and we set all of those k bits to 1.
So we iterate through the hash functions 1 through k, and then we compute it’s hash value, and we set this bit to 1. It might already be 1 as before, but we always change bits from 0 to 1 just like before.

How do we check whether an element X was inserted into our dictionary S?
We compute it’s k hash values and check whether all of those k bits are set to 1 or not. If all of those k bits are set to 1, then our best guess is that X was inserted into the database S. If any of those are still 0, then we’re guaranteed that X was not inserted into the database.

Let’s take a look at the correctness of this algorithm for our queries.
Suppose X was inserted into our database S, and we do a query on X. What do we output?
Well, when we inserted X into the database, we set all of these k bits to one.
So when we do a query, we’re guaranteed that all of these bits are set to one, and so we’re going to output Yes, because none of the bits ever change from ones to zeros.
Bits only change from zeros to one. It’s a one directional process. So if X was inserted into the database, when we do a query on X, we always output Yes. It is in the database.

Now, suppose X was not inserted into the database and we do a query on X.
Sometimes, we might say yes, we believe it’s in the database.
In which case, we get a false positive. We falsely say that, yes, it’s in the database.

How can this occur?
This can occur if all of the k bits were set to one by other insertions.
There is some element Z, which was inserted into the database S and one of the k bits for Z exactly matches the ith bit of X.

Which of the k bits for Z?
Let’s say the jth bit for Z. So the jth bit for Z matches the ith for X.
In other words, H[i] of X = H[j] of Z.
This means that when Z was inserted into the database ,then we set this bit which matches the ith bit of X to one.
And if this is true for every bit of X, so all the k bits of X are set to one by some other insertion. Then we’re going to get a false positive on X.
So this scheme has this extra robustness or redundancy.
In order to get a false positive, we need all of these k bits to be set to one by some other insertions, whereas the previous scheme only had one bit which we were checking.
Now we have k bits which need to be set to one in order to get a false positive.
So it seems like things improve as k increases. But in fact there’s an optimal choice of k.

If k gets too large, the false positive rate starts to shoot up again. Why is that?
Well if k is huge, then for every insertion you’re setting k bits to one.
You’re setting many bits to one if k is huge. That means for each of these insertions, each of these elements in S, they have many bits, many choices of j which are set to one.
So it’s more likely if k is big, that one of these k bits is going to match up with one of the bits of X.
So if k is too large, every insertion is setting too many bits to one. If k is small, when we’re doing the query on X, we’re checking too few bits.
So there’s some optimal choice of k, not too large and not too small.

What we want to do now is more precisely analyze these false positives.
What’s the probability of a false positive?
We want to look at it as a function of k and then we can figure out the optimal choice of k in order to minimise the false positive rate.
And then we can compare and see what that false positive rate looks like to see whether this is a good data structure to use.

False Positive Analysis

As before, let m, denote the size of our database or dictionary that we’re maintaining.
And let n denote the size of our hash table.
Now, presumably, n, the size of our hash table is going to be at least the size of the database that we’re maintaining.
So, the important parameter is going to be the ratio of these sizes.
So, let c denote this ratio.The size of the hash table compared to the size of the database. c is going to be at least one.
And our goal is to try to get the smallest c possible.
The size of our hash table is c times m.

Now, for an element X, which is not in our database. Let’s look at the probability of a false positive for this X.

In order for this to occur, we need that all the k bits h1(X), h2(X) up to hk(X), were all set to one.
If all of these bits are one but X was never inserted into the database, then we’ll get a false positive.
So, let’s try to analyze this probability that all this k bits are set to one.
Let’s first look at a simpler problem for a specific bit, b which ranges between 0 and n-1.

What’s the probability that specific bit is set to one?
It would be slightly easier to look at the complimentary event that this specific bit is set to zero. So, the probability that this specific bit is one, is

Probability of k bits are set to 1

Now, to analyze the probability that it’s still set to zero, what we have to do is to check that all of the insertions miss this one bit.

Now, we have m insertions. In our simple hashing scheme, these insertions correspond to throwing a ball into a bin. So, this corresponds to throwing m balls into bins. But notice for each insertion, we have k hash values that we look at. And we set k of these values to 1. So, each insertion corresponds to k balls. Therefore we’re throwing m times k balls into n bins.

In order for this bit to still be zero we need that all these m times k balls miss this specific bin, b.
So this probability that this bit is zero is equivalent to the probability that all m times k balls miss this specific bin.
For one ball, what’s the probability that it misses a specific bin?

Now, this expression is not very complicated but will be much more convenient for us to have a slightly simpler expression. Let’s try to
manipulate this to get a slightly more convenient expression.

Lets look at the Taylor series for the following exponential function.

This is an infinite series. Now for small a this series is decreasing,
and as a goes to zero, then this series is approximated by 1-a. So this is a good approximation when a is sufficiently small. That’s going to correspond in our case to n being sufficiently large. So let’s use this approximation to simplify our analysis of the false positive rate.

So now we have a very simple expression for the probability that a specific bit is zero.

So what’s the probability one of these specific Bits is set to one?
It’s one minus the probability that it is set to zero..
And we want k specific Bits all set to one. So the probability of that is,
this raised to the power k.

To summarise,

False Positive Analysis

This expression is the false positive probability. It’s not very nice right now because we have this k. Can we simplify this by eliminating k?

Can we figure out what is the optimal k in order to minimize this false positive probability?
Recall our intuition from before, we wanted to have k not too small,
if k is small, then when we do a query, we’re checking too few bits.
But if k was big, then when we do an insertion we’re setting too many bits to 1. So there’s some middle ground and we want to figure out the optimal choice of k in order to minimize this false positive probability.

So what are we going to do?
We’re going to take derivative of this function. Set it equal to zero and find the optimal choice of k in order to minimize this expression.

Where does that optimal happen?
It happens at k = c ln2. That’s c times the natural log of two.
Let’s plug this choice of k back into this expression.

c is the ratio of the size of our hash table compared to our database.
Now we have a simple expression for the false positive probability.
So if you tell me how large of a hash table you’re willing to do, I can tell you what the false positive probability is.

Examples with naive scheme

Let’s now look at some specific examples to see how this performs.
Let’s suppose we did the naive scheme, where k = 1.
We didn’t do the optimal choice of k. We just set one hash function.
And let’s look at the case where we do 10 times larger or 100 times larger.
In order to analyze this case where k = 1, we have to go back to our expression of f(k). We get the following result

Examples with optimal scheme

Now, suppose we do the optimal choice of k. So,

It’s very reasonable to consider a hash table which is 100 times bigger as this is just going to be a binary string.
Now, the false positive probability is 1.3 times 10 to the minus 21.
The key thing is that this is exponential in c. So, taking c = 100, it’s tiny. This is really a minuscule probability. And if this is not small enough for you,
you can go c equals to 200, or 300 and you’re going to get a really, really tiny probability of a false positive.

So, if you’re willing to have a very small probability of a false positive,
then you have this very simple data structure which just corresponds to having a binary string. It’s very simple to maintain and has very fast query times. Also the false positive probability is very small.

The downside of this data structure is that occasionally,
you might have some false positives and also it doesn’t easily allow for deletions from the database. Though, there are some heuristics for allowing deletions, these are modifications which are called Counting Bloom Filters.

--

--