For the last few weeks (since 3rd of December 2017), I have been obsessed with the idea of optimising Ethereum client (in particular go-ethereum, but I am sure almost everything can be carried over to other implementations). I believe this is an urgent demand, and it will not go away even with introduction of Casper.
Majority of optimisations are related to handling of Ethereum state. The main goal of optimisations is to reduce the time it takes for blocks to propagate through the network, which will allow further increase in gas limit without corresponding increase of the uncle rate.
Now I feel I need to publish this, before I get deeper and deeper and start neglecting other aspects of my life. I need to publish this to get rid of that obsession.
This is my plan. I have forked go-ethereum repository, and started making changes. When the changes are more-less polished, I will try to make into a working client. But the intent is, of course, to get at least some of these changes ported back into the go-ethereum. Now I will list the optimisations. I am planning to further expand the points with more details. I will also add preamble describing the current layout of the state in the go-ethereum node.
- Reuse of stateDB object, and specifically its storage tries (one per account) between the blocks. Currently, the stateDB object is re-created before each new block gets processed. There is a cache of the account trie, which allows new StateDB object to pick up the old trie, but there is no cache of storage trie. Therefore, storage tries come into processing of every block cold. I have implemented this change in my version in these commits: this and this.
- Change the trie node caching from generation-based approach to node-count based approach. Currently, by default 120 so-called generations of the trie nodes are stored. Generation advances whenever the state is committed into the database (after block processing). Each node carries the generation mark, which is brought in line with the current generation whenever the node is touched. Nodes with generation mark smaller than “current generation — 120” are evicted from the cache. The change is to have generation advance every time a node gets touched. That means we won’t keep certain number of generations, but instead certain number of nodes. This parameter (number of nodes to keep) will be much easier to calibrate than the number of generations, so more memory can be given to the trie node cache. I have implemented this change (with some other stuff) in this commit.
- Change the way the trie nodes are stored in the database, specifically what constitutes the key. Currently, each node is stored as one record in LevelDB. The key is the 32-byte hash of the node, and the value is the RLP serialisation of the node. Nodes can be of 3 types: full nodes (array of 17 elements, 16 for each nibble, and 17-th for the value), short nodes (key-value pair), and value nodes (leaves of the trie). Value nodes exist only theoretically, but not actually in the database, because their existence requires having 2 SHA3 images that only differ in the last 4 bits. Practically, all trie leaves are contained in short nodes. When a trie branch is “cold” (not expanded), in order to reach a certain leaf or sub-branch, a few levels of the trie need to be expanded (4 levels on average). Each expansion requires a lookup in the LevelDB. The worst thing about it is that these lookups have to be sequential, because the key for the deeper level is contained on the level right above it. The change is to have LevelDB keys not being hashes of the values, but instead based on the paths from the root to the branch of leaf. The key would be constructed as follows: <trie prefix>|<path>|<block mask>. <trie prefix> allows differentiating tries from each other. In my implementation, account trie has “AT” prefix in all its keys, and each storage trie has the address of the account as prefix. <Path> is a compact representation of path from the trie root to the branch or leaf. <block mask> is bitwise-inverted value of block number in which the entry was created. This trick allows finding the nodes that were added prior to a certain block using LevelDB’s range queries (LevelDB stores records in sorted order, so the range queries are efficient). I have implemented this change, but it is spread over many commits.
- Handling of chain re-orgs. After change (3), in order to support chain re-orgs, we need to be able to remove all nodes belonging to the certain range of blocks, before we apply the new blocks. This could be easily achieved by storing some additional info in the database. Each time we store a trie node, we also add a record with a key <block number>|<counter> and value being the key of the node. When we need to clean up, we use range query <block number>* to find all nodes to clean up. I have not implemented this change yet.
- State reads to require only 1 database lookup. With the change (3), and a small addition, where for each short node, we also write out the value node separately, with the full path, we can read values from the state without expanding the trie at all. I have implemented this change in this commit, but it is a bit messy. Subsequent profiling sessions showed that state reading is now negligible and does not even appear in the report.
- Pre-fetch trie expansion lookups for state modifications, in parallel. Because of change (3), trie expansions are not dependent on each other, and can be performed in parallel. I have implemented this (spread over many commits) and currently profiling and tuning it. It is currently not clear whether this in fact is an improvement. It looks like LevelDB serialises concurrent reads, so this could only work with a different database (perhaps BoltDB or BadgerDB?)
- Pre-fetch based on transactions in the mempool. Looking at the current mempool, nodes can estimate what transactions can be included into the next block. And, I conjecture, in most cases, running them against the current state and pre-expanding some areas of the tries will speed up the processing of the next block, when it does come. I have not impemented this change yet.
- More efficient memory usage for storing the trie. After change (2), the next observation that needs to be made (I have made it in the past and need to re-verify it for Ethereum tries) that most of the trie nodes stored in memory will end up being full nodes with 2 occupied and 15 non-occupied nodes. Because they are stored as array of 17 elements, each element being a reference (8 bytes), it ends up being 136 bytes per node. If this type of node does indeed turn out to be dominant, we can store it in 17 bytes (4 bits for the index of the first non-empty child, 4 bit bits for the index of the second non-empty child, and 16 bytes for children).
- More efficient storage of trie nodes in the database. I have not done the analysis, but I suspect that there is quite a lot of repetition in the trie nodes stored in the database. In particular, certain trie nodes, once touched, but then never used, will be repeatedly referenced from the sibling nodes. Once I have done analysis on this, I can suggest some ways to improve this.