Use of Probabilistic Data Structures: Bloom filter

Luciano Sabença
Sinch Blog
Published in
6 min readMay 13, 2020

One of the most important topics in computing is, for sure, the study of data structures.

The use of an adequate data structure for the algorithm can help to obtain an optimal performance as much as the use of an inadequate one can make the algorithm much slower than it should be.

One of the most recently discussed topics in this area is about probabilistic data structures.

There are several definitions for what they are. A particularly good and generic definition is that probabilistic data structures are data structures that, unlike traditional ones, can only give approximate answers when consulted.

This approximate result is correct with a probability X, hence the name given to them. The main reason that this topic has gained momentum in recent years has been the incredible increase in the amount of data that we have available today, making processing and storage huge challenges.

That is why, considering the large volumes of data that we have here at Wavy, the probabilistic structures are so efficient.

Bloom Filter

One of the most famous probabilistic data structures is the Bloom filter. Created by Burton Bloom in the 70s, it proposes to answer a single question: given a key, can it be part of the set represented by this Bloom filter?

Bloom filters are extremely efficient in the use of memory, being able to store something in the order of 100 million keys with less than 200MB of memory.

As a disadvantage, they can return false positives, that is, the query can return that the key may belong, when it is not actually part of the set represented by the Bloom filter.

Although there are false positives, a Bloom filter will never return a false negative, that is, it will never say that a key is not part of the set when, in fact, it is.

The big catch behind Bloom filters is how it codes the existence of the keys. Suppose you would like to see whether an integer belongs to a set. The simple and obvious solution to this problem would be to store all the numbers that belong to that set in memory, perhaps using Set or Hash Map.

The question is: each long number occupies, in the vast majority of languages, 8 bytes, that is, 64 bits. A Bloom filter, in turn, does not represent the number using all 64 bits of its normal representation, they use much less. Normally 10 bits are more than enough to have an extremely low percentage of false positives.

The way he does this encoding is through hash functions that are applied to the key.

Each of the hash functions generates a value that is used as an index to a vector of bits. For each of the indexes, the bit in that position is replaced by 1.

For the query, the same procedure is applied: if all the bits in the positions generated by the hash functions are set to 1, it is possible that the key exists in the set, if anyone of the bits is set to 0, the key will not exist in the set.

David Eppstein — self-made, originally for a talk at WADS 2007)

Bloom filters in their traditional version, unlike most data structures, do not support deleting elements: once an element has been added to the set, it will remain there forever.

This is not a big problem: keys that are no longer useful will generate false positives, which is completely acceptable and, if the rate of false positives is too high, just regenerate it from scratch and re-enter the numbers that still do part of the set.

The use of Bloom Filter at Wavy

One of the uses we make of Bloom filters at Wavy is in the SMS sending.

One of the crucial steps in this flow is to identify which mobile operator is the destination of the message. One of the steps to verify the destination operator is to verify that the number has been ported from the original operator to another one.

Basically, this information is represented as a list with all ported numbers from all the countries in which we operate, and we need to consult it for each one of the millions of messages we send each day.

The first solution we tried to use for this problem was to store the numbers ported in memory, on a map with the number, represented by a Long, and its operator, represented by an integer.

However, over time and with the growth of our operations, we started to have serious performance and cost problems due to the high memory usage and high startup time of the solution: we needed to load all ported numbers, several tens of millions of values, each time the system is initialized.

We needed another solution and we thought that using a Bloom filter could help us with this!

Then, instead of storing all the numbers in the application’s memory, we use an external database and use the Bloom filter as a first filter to find out if there was a chance, or not, of the number has been ported.

This solution gave us important advantages: we kept the latency extremely low for the vast majority of cases (something close to 2 orders of magnitude, if we consider a solution with only the database) and also the application started to load only the Bloom filter, whose memory usage is small. As a database we chose to use a non-relational database: Redis.

Redis is a non-relational database that also acts as a persistent key-value cache, which perfectly fits our problem: our key is the phone and the value is the operator to which the number has been ported.

In addition to the data model fitting perfectly with our case, we also obtain interesting advantages when using Redis: as it stores all data in memory, its operations are extremely fast, which leaves us with a still low latency even in cases where we need to consult it!

A simplified diagram of our architecture can be found in the figure below:

Results

We obtained great results with the application of Bloom filter in our architecture. In comparison with our first solution, we saved a total of about 2.5GB of memory per instance of our sending systems, which totaled about 100GB of saved memory.

The startup time of our platforms has dropped from around 300s to less than 15s, since we started loading only the Bloom filter during the load process.

Finally, we only consulted Redis for about 12% of our total shipments, of which only 5% were queries caused by false positives, that is, numbers that the Bloom filter returned that could have been ported, when actually the consultation at Redis revealed that the number had not been ported.

The table below shows an example of this data from one of our instances, since the beginning of the Bloom Filter implementation:

Conclusions

One of the most important lessons we learned from this case is that it reminded us, again, how useful good data structures are.

It is also worth remembering that — in addition to the probabilistic data structures that solved our problem so well — there are several types of structures for other, more specific problems.

To name a few examples, we have the CRDT (Conflict-free Replicated Data Type — which are gradually becoming relevant to avoid conflicts in distributed systems) and persistent data structures (like those present in some programming languages, especially functional ones, like Clojure and Scala).

Some references:

--

--

Luciano Sabença
Sinch Blog

Software Developer @Movile; Bachelor of Computer Science by University of Campinas (UNICAMP)