Invertible Bloom Lookup Table

Joonmo Yang
CodeChain
Published in
5 min readMay 6, 2018

Horizontal scalability is one of the most important issues in modern software development. If your program could handle more tasks just by creating more instances of it, life would be much easier. But as you have experienced, it is extremely hard to achieve such kind of scalability in real world. Almost every aspect of modern computer system tackles in while solving this, and things get more complex if you are targeting very large system.

The root cause of difficulty in parallelization is that most of the important operations of programs are tightly coupled with its underlying state. In other words, each instances of program has its own state, possibly different from each other’s, so they might not have identical behavior. This is why many people pay attention to data synchronization scheme, to assure that every nodes have identical internal state.

Let’s assume that Alice and Bob want to share list of items, for example amount of fruits left in their storage. Alice knows that they have 3 apples, 10 grapes, and 7 melons. Bob knows that 10 grapes, 7 melons, and 2 lemons are left. In naive algorithm, Alice would tell all the fruits that she knows of, and Bob tell Alice that she doesn’t know about, (lemon, 2) in this example. This is quite inefficient, because they have to include 3 items in message, while what they really need is only 1 item, which is the size of difference between their internal state. It would be far more efficient if they could send only the difference between them, not a full list of fruits. This is where IBLT(Invertible Bloom Lookup Table) comes in.

Invertible Bloom Lookup Table

Basic idea of IBLT is quite similar to counting bloom filter, but it adds some more functionality to it, which is really useful for data synchronization problem. Features of IBLT includes:

  • Space proportional to difference between two sets
  • Can save key and value of each items

IBLT maintains count, key_sum, val_sum, and hash_sum in each table row. On each item insertion, count is incremented by 1, and key_sum and val_sum are merged with current key and value by xor. hash_sum is for fault tolerance, and merged with hash of current key. Of course, like any other bloom filter derivatives, key is hashed with multiple hash functions, so key/value pair is inserted to multiple indices.

Let’s try to build a IBLT from Alice’s fruit list, with size of 5 and hash count of 3. If apple hashes to 0, 1, 4, and 11, table would look like this.

Next, if grape hashes to 1, 3, 4, and 6

And if melon hashes to 0, 2, 4, and 9

Now, there’s some items that have count of 1. If we’ve done only insertions, those key_sum and val_sum would maintain its original values. We call them pure items, and they can be safely removed from the table. From key and value of pure items, we can recalculate indices where it was inserted, and peel them out from the table. By peeling out pure items one by one, we can recover original list of items from IBLT.

As you notice, to decode constructed IBLT, there should always be at least one pure item at every step of decoding. To achieve this, IBLT should have larger space than its containing item. Of course it is far more inefficient than simple lookup table if we use it as table containing all of the items. But the fascinating thing is that you can subtract two IBLTs.

Alice’s IBLT, called X, and Bob’s IBLT, called Y, will have (grape, 10) and (melon, 7) in common. In other words, in each target index, both table have xor-ed key, value, and hash of (grape, 10) and (melon, 7) in its key_sum, val_sum, and hash_sum. This means that if we subtract each counts of X from Y, and xor each key_sum, val_sum, and hash_sum of X with Y, then common elements will be canceled out. Resulting IBLT, called Z, will contain only difference between X and Y(in mathematical word, (X-Y)∪(Y-X))

But the problem is, rows with count 1 might not be pure items anymore. Since IBLT uses multiple hash function for generating index, and some items can have overlapping indices, some of the items might’ve just “accidentally” got count of 1. To handle this case, we check if hash of key_sum is equal to hash_sum too. Also, since we did subtraction, pure items that were included in X but not in Y would have count of -1. So to decode this subtracted IBLT, we seek for items with count 1 or -1, and check if hash of key_sum is equal to hash_sum, and peel it from IBLT.

It might seem wierd at first place, and some might wonder if it really works as expected. There are already some papers about this out there, and it says over 90% of cases were successfully decoded where table size 50, and set difference 30. According to the paper, it seems that hash count of 3~4 is good enough, and with 1.5x size of set difference, 99% of cases can be successfully handled.

IBLT is a relatively young data structure, and there are not enough use cases of this in the wild. However, there are already some people trying to utilize this, such as in Bitcoin block propagation. IBLT has a lot of potential, and has a lot of advantages over other conventional techniques. Maybe it would be worth a consideration for you next system design.

--

--