Written by Kelvin Fichter, Nate Rush
We spent some time at the IC3-Ethereum Boot Camp hacking on Ethereum’s Bloom filters. This article goes over the result of that work, and some of the interesting things we learned. The code used for these experiments is available on GitHub here.
Quick FYI, this article assumes you know a little bit about bit arrays, hash functions, and Ethereum!
Let’s first explain what Bloom filters actually do. A Bloom filter is a probabilistic, space-efficient data structure used for fast checks of set membership. That probably doesn’t mean much to you yet, and so let’s explore how bloom filters might be used.
Imagine that we have some large set of data, and we want to be able to quickly test if some element is currently in that set. The naive way of checking might be to query the set to see if our element is in there. That’s probably fine if our data set is relatively small. Unfortunately, if our data set is really big, this search might take a while. Luckily, we have tricks to speed things up!
A Bloom filter is one of these tricks. The basic idea behind the Bloom filter is to hash each new element that goes into the data set, take certain bits from this hash, and then use those bits to fill in parts of a fixed-size bit array (e.g. set certain bits to 1). This bit array is called a bloom filter.
Later, when we want to check if an element is in the set, we simply hash the element and check that the right bits are1 in the bloom filter. If at least one of the bits is 0, then the element definitely isn’t in our data set! If all of the bits are 1, then the element might be in the data set, but we need to actually query the database to be sure. So we might have false positives, but we’ll never have false negatives. This can greatly reduce the number of database queries we have to make.
Bloom Filters in Ethereum
Bloom filters are used in Ethereum to minimize the number of block queries that clients need to make. Imagine you want to find every time you’ve transferred some token X. The naive way of searching for this information would be looking through every past historical transaction; this would take a very long time! Luckily, the ERC20 standard requires that contracts fire off an event (or “log”) whenever tokens are transferred. These events are then stored in a Bloom filter that is part of each block.
From here, finding all your token transfers becomes much easier. You can first to take the event of interest, hash it, and figure out which bits to look at in the block-level Bloom filters. If any of these bits are set to 0, then the event must not have been generated in any transaction in this block. If all of the bits are 1s, then you can check the rest of the block for possible transactions.
Blooming the Blooms
At the IC3-Ethereum Boot Camp this summer, we realized it might be possible to completely fill up the block-level Bloom filter. Remember that certain bits of a Bloom filter are set to 1 based on the hash of the elements that go into the filter. If we can carefully pick which elements to insert, we could set every bit in the bloom filter set to 1.
The Ethereum Bloom filter implementation is currently 2048 bits and sets 3 bits in the filter based on the first few bits of the hash of the item. This means we can fill the entire filter by generating only 683 (ceil of 2048/3) events! A smart contract can generate as many events as it wants, so this can definitely work.
The next challenge was figuring out how to make this as efficient as possible. We first thought we might have to create a smart contract with a bunch of different events and fire off all of those events in a single function call. Unfortunately, that would’ve made deploying the contract prohibitively expensive.
Then we looked into how Solidity events actually work under the hood — they actually just execute a LOGX instruction! We realized that it wasn’t even necessary to actually define any events and trigger them manually, we could just log a bunch of junk. The whole contract took up only a few lines of code.
Next up was figuring out what data to put in each log to fill the Bloom filter correctly. This basically just required grinding out 3-byte values, hashing them, checking what bits they would flip in the filter, and throwing out any preimages that overlapped with something else. 682 events get us to 2046 bits, so the last event will necessarily overlap with at least one other bit.
Searching for the right input data gets exponentially more difficult as you get closer and closer to the full filter. Our laptops could only complete the search after 10–20 minutes, so we cheated a little and lowered the overlap requirement for the last few events. This trick resulted in a huge speedup (mining the required data in 10–20 seconds) at the cost of only ~10 more events.
LOGX instructions cost 375 gas per instruction, plus 8 gas per byte of topic data. Calldata is another 68 gas per byte. With the base transaction fee of 21,000 gas, we’re looking at a theoretical limit of a little over 330,000 gas to fill the whole block. Our method usually lands at about 515,000 gas, so we’re far from the maximal limit (probably because of loop overhead).
A full Bloom filter is really annoying for Ethereum clients. Anyone who wants to check for the presence of some event in a block would have a 100% false positive rate, which triggers a database query. Database queries are slow, so this could theoretically have a performance impact on the network. At the very least, a full Bloom filter is a useless Bloom filter, and it’s an extra 2048 (pointless) bits that need to be transmitted along with each block.
Fortunately, this attack isn’t cheap. At current gas prices, you’d need to spend $0.20 — $0.30 per block, every block, to keep filters full. We’ve explored some client code and determined that the impact is probably quite low, but it seemed useful to explore.
There are a few ways to mitigate this attack. The EVM could simply charge a little more for the LOGX instruction to make up for the potential database query. We could also agree to use a larger Bloom filter to (linearly) increase the cost to attack. However, this solution would also up the Ethereum block size.
Or we could just ignore this entirely (for now). It doesn’t seem like breaking behavior right now, though that might change in the future. It’s definitely something to keep an eye on! People have been talking about Ethereum’s bloom filters a little bit recently, so maybe we’ll start to see more conversation or even alternative designs.