Filtering data with bloom filters and rust
Filtering large data sets efficiently is a common problem, think about the following scenario: we have a large collection of text documents, and we want to see which ones are present in our database and which ones are not.
One way to do this would be to simply run a query on the database and find which documents are missing. But this approach potentially puts a lot of load on the database.
Consider the size of the problem, what if we have one million documents to check, and only five of these actually end up being found? We ended up running a massive select query on the database.
A naive solution
If we have control over the way the documents are added to the database, one potential solution could be to also add the documents to a set, perhaps backed by a hashtable, while adding them to the database.
This way when we later run queries against the database, we could first consult the hashset and filter out the documents which are not found in there, our previous query would now whittle down to just the five documents out of a million which were missing in the database (and therefore missing in the set).
However we now run into another problem, the set is going to be quite large, and is going to live entirely in memory. In effect, we have recreated a subset of the columns of the database in the memory.
This could be very expensive in terms of memory used, depending on the number and size of documents.
Hashing, part 2
One other solution, also based on hashing, could be to maintain an array of bits, all of which are initialised to 0. While adding a document, we calculate its hash. Most hash functions return a large integer as the result. We could somehow map this integer to one of the indices in the bit array, and set that index to 1.
Now when we want to run the query against this data structure for a given document, we simply hash it again, calculate the index in the exact same way we did when adding it, and check if that bit is set or not. If it is set, this means that the document has been added earlier. If the bit is unset, we have not added this document before.
We now know which documents to filter out before hitting our database with a select query, and we also do not need to store the entire documents in memory, we can get away with simply storing a bit array which is quite small in comparison.
While this approach seems to solve our problems, there are a couple of issues here:
Collisions from the hashing algorithm
Hash functions may have collisions. While the chances of this happening are low and depend on the hashing algorithm used, if the problem size is large enough, we may run into this. Let’s look at what would happen in case of a collision.
Let us assume that the strings “sodium carbonate” and “sodium chloride” both result in a hash of 1221. “sodium carbonate” is added to our database, which results in the index corresponding to 1221 being set. Now if we query the data structure later for “sodium chloride” we first hash it, which gives us 1221, which leads us to an index which is set, although we never added this string to our database.
It leaks through the filter. We will call this a false positive.
Even if the chances of a hash collision (and a resulting false positive) are low, the next problem we face is much more prevalent.
Collisions from index resolution
Remember when we said earlier that we would somehow have to map the result of the hash function (which is a large integer, referred to here as r) to an index in our bit array (whose size is referred to here as N)?
Since the result of the hash is very large and the bit array is not going to be large enough to simply use the result of the hash function as an index, we need to map the result to an index.
This is quite often done using the modulo operation, which gives us the remainder of dividing r with N. This value will cycle between 0 and N depending on the exact value of r.
This is where a lot of collisions can happen. Assume that we have five strings, hashing them gives us the numbers: 507, 107, 57, 9214007 and 657. Now assume that our backing array has a size of 50.
All of these numbers will result in the index 7, when run through a modulo 50 operation. If we add any one of these five strings to our data structure, the other four will appear as false positives because that bit at index 7 is now set.
This sort of collision is going happen quite frequently, compared to the hash collisions we looked at earlier.
Bloom filters
Bloom filters are a probabilistic data structure which reduce the false positives described earlier to quite a low number using a simple improvement on the data structure we just looked at.
The fundamental draw of a bloom filter is that its negative answers (element not in set) carry a guarantee of correctness, while its positive answers carry a very high (but not complete) degree of certainty.
When adding an element, instead of using one hash function, we use k hash functions. Then for each of those hash functions, we set the bit corresponding to the result of the hash function for that element in the bit array. This results in k bits being set in the array per element instead of just one.
When querying for an element, we simply calculate the k hashes, compute the indices from them, and check each of the indices. If even one of the indices is not set, we can be sure that we did not add this element to the filter earlier.
However, if all bits are set, the element may have been added to the filter earlier. We will look at the reason for this uncertainty quite soon, but first, let us look at the collisions which were a major problem with the previous solution.
Because the bits are now spread out over the array, chances of a collision are drastically reduced.
The probability of all hash functions producing the same hash for an element, as well as the possibility of the modulo operator reducing these results to the same indices in the array, is quite small.
Now however we have another collision to contend with, which will be caused by a union of the indices set by other elements. Let us look at an example, where k is 3, which means we have three hash functions and three output indices per element:
- sodium chloride results in indices 101, 4 and 268
- sodium carbonate results in indices 995, 4 and 54
- benzene results in indices 268, 1022 and 9
- carbon dioxide results in indices 101, 88 and 345
If the last three elements are added to the filter, sodium chloride will be a false positive, because all the bits the three elements set directly cause the bits required by sodium chloride to also be set.
This measure of false positives can be mitigated somewhat by tuning the number of hash functions as well as the size of the bit array.
Tuning the filter
There are a few ways we can tune our bloom filter. This involves trade-offs which are driven by the rate of false positives we can accept, and the memory we are prepared to dedicate.
The following values can be tuned:
- the size of the backing bit array — the wider the array, the lesser the chances of a false positive. On the other hand this increases the memory usage of the filter.
- the number of hashing functions — more hash functions applied to an input give more bits to set, spread out over the bit array. When the number of hash functions is down to 1, the bloom filter effectively behaves like the naive implementation described earlier, and will suffer from the same index collision. However as we increase the number of hashes, although the rate of false positives goes down, the cost of both adding the element as well as querying for an element goes up.
Downsides
There are certain downsides to a bloom filter. The biggest one is that we cannot delete an element from the filter. If we think about how a delete would work, we would take the element to be deleted, calculate the k hashes, and un-set all of the bits on those indices.
If we look at an example:
- sodium chloride sets the bits 11, 98 and 202
- benzene sets the bits 204, 98 and 887
When deleting benzene we unset its three bits. But now when we check for sodium chloride, bit 98 is unset, and we come to the incorrect conclusion that sodium chloride is not in the filter. Our guarantee is broken!
One possible solution could be to keep counters instead of bits, and increment and decrement the integers at the indices, and declare that an element is missing, only once the index is 0. In the above example, after adding both the elements the index 98 will be at 2. Removing benzene will make it to 1, and thus the test for sodium chloride will still pass.
However we will be now forced to use an int array instead of a bit array, which means a size increase of at least eight times the original space requirement.
A Rust implementation
Rust is a systems programming language which provides comprehensive features to ensure memory safety as well as a rich set of tools to develop programs. An implementation of the bloom filter that I described above is available on github.
This is a bloom filter behind a minimal rest API, allowing a user to add strings to the filter as well as query the filter with a list of strings.
I had to run through a couple of optimisations to make the bloom filter more efficient, which are described in some detail below:
Hashing
Using a single hash algorithm to generate the k indices: I set k to 40 for my testing, and since calculating the hash of a string 40 times is quite expensive, I used the following workaround:
- calculate the hash of the string. I use metro hash from the fasthash library for hashing.
- calculate the hash of the hash produced in step 1: the hash of step 1 is an integer, but we may convert its hexadecimal representation to a string to calculate the hash.
- start with the hash of step 1, and keep adding the hash of step 2 to it, k times. Each addition gives us an index which (modulo
N) is the bit to set. Since our goal is just to spread the bits around for a given string in a reproducible fashion, this strategy works for our use case and is faster than hashing the string 40 times.
Modulo operation
The modulo operator involves division, which is quite slow. In fact when I ran a profiler on test code I found that the program spent more than half the time during division.
Here we can make use of the fact that we control the N (the size of the bit array) and the fact that a mod N where N is some power of 2, is simply a & (N-1) which is slightly faster than division, and by extension, modulo. So we always set N to some power of 2, and do a bitwise AND instead of modulo.
This gives us some gains in terms of speed.
Structure of the application
The application is quite barebones, at startup a new bloom filter is created and handles to it are passed to the request handlers.
Since we use actix which uses a factory based approach to create handlers, we have to create an Arc, which is basically an atomic reference counted entity owning our heap allocated filter object and enabling shared access to it.
The factory based approach means that the function we supply to actix is actually used to create a handler each time a request is served. If we simply pass the filter object to the function, each request will be served by a new copy of the filter, hence the ref-counting approach.
Clones of the arc (which each point to the same filter object on heap) are passed around to the request handlers, and a mutex is used to protect the filter from concurrent access. The arc increases the ref-count on each clone operation, and each request works on the same filter object.
The rest API allows an upload of a list of strings, which are added to the filter, and querying the filter using a list of strings, which returns a payload containing matching and missing strings.
The server can be run simply by cloning the project and running cargo run from the root of the project.