Looking back at the Ethereum 1x workshop 26–28.01.2019 (part 1)

We held a small event at the Stanford University Campus, which was kindly hosted by Stanford, supported by Ethereum Foundation and ConsenSys, and benefited from participation of a number of Ethereum researchers, client developers, infrastructure engineers, dApp developers, and enthusiasts.

Throughout these three days, we stayed focused on exploring the risks to the the longevity of Ethereum 1.0 from the technological point of view. Apart from the risks, we discussed various mitigations that would reduce these risks or delay their adverse effects.

What makes these tasks both hugely rewarding and extremely challenging is the existence of the network, the blockchain, the cryptocurrency, and the ecosystem around the protocol and the software that we look into changing.

I will start describing what we learnt during the workshop, piece by piece, so this will take multiple parts. Perhaps afterwards I will merge it into one paper.

Problems with large (and growing) state

Failing snapshot sync

Imagine the node on the bottom of the following picture is performing a snapshot sync from the node on the top of the picture. We are assuming “fast sync” mechanism here. Parts of the state trie getting synced are highlighted with green.

Next, more of the state is synced, but the top node has moved on and pruned the state corresponding to the first block header:

Next, block 1 is removed from the picture, the state at the block 2 is now pruned:

And so on:

State at the block 5 is getting closer to the point of pruning:

Finally, the state at the block 5 is getting pruned, and the snapshot sync fails, with some of the leaves of the state trie missing, because they are not available on the source node anymore, after being pruned:

I have first discovered this during my conversation with Trinity team (they are building Python client for Ethereum 1.0 and 2.0). Although Trinity client fully implements the Ethereum protocol, its nodes have troubles performing the snapshot sync.

The illustration above is, of course, a simplification for the sake of initial understanding. The reality is more nuanced:

  1. The merkle tree is not binary but 16-ary
  2. Sync is happening from multiple peers at once
  3. Illustration is based on the idea of Geth’s fast sync. Parity’s warp sync works in a different way, i.e., it does not download the merkle tree from root to leaves, but rather only leaves organised in chunks
  4. Geth’s current code does not prune the state history once it is committed to the database, but rather it is pruning while that history is still in memory. The default is to prune histories older than 120 blocks in the past, but it can be changed via “tree-cache-gens” command line option. However, increasing this option also increase memory footprint of the node. In addition, Geth commits state histories to the database every now and then (by default whenever importing of canonical blocks exceed 5 mins), and these histories are not pruned later on.
  5. Parity’s pruning is more consistent with the illustration above. It does prune old state histories that were committed.

Possible mitigations:

  1. Ask node operators to keep state history for longer. This consumes more disk space in case of Parity, and can also increase the memory footprint in the case of Geth.
  2. Change Geth’s code to commit the snapshot not every 5 min, but on some predictable block numbers (for example, on block numbers that are multiples of 20).
  3. Develop and deploy snapshot sync mechanisms that does not suffer from this failure and are able to “gracefully” rebase their failed snapshot to the newer version (with the help of the peers). I know such initiatives already exist in Geth, Parity, Turbo-Geth and Trinity, and I expect them to be accelerated from now on. On the day 3 of the event, I gave a presentation about sync mechanisms (present and future), and soon should be able to write about it.

Further investigations:

As you can hopefully see from above description, the snapshot sync failure is not a straightforward phenomenon. It depends not just on the bandwidth, pruning thresholds, but also on the composition of peers, their uptime, and so on. We will be using the toolset of whiteblock.io (part of the Emulation/Simulation working group) to study the properties of the sync failure by the means of repeatable experiments. Once it is clear that we understand the phenomenon well enough, we can use simulation framework to model the failures into the future.

Solutions:

Reduction of the state size, or at least the rate of its growth, should help limiting the problem, and allow the client software optimisations (such as advanced sync mechanisms) to catch up.

Duration of snapshot sync

Even if the previous problem is mitigated, it still takes longer and longer to perform a snapshot sync for a new node joining the Ethereum network.

Possible mitigations:

  1. Using advanced syncing mechanisms mentioned above, prioritise syncing of the certain parts of the state. For example, one can always prioritise syncing of the parts that are accessed by the ongoing blocks.
  2. Use techniques to compress the snapshots. For example, part of the design of the Gastoken, which is one of the largest source of newly created contracts, was to make the state created by the Gastoken highly compressable. Other idea is to use some data blob (like enumeration of all addresses) that can be downloaded ahead of time, and enable further compression of the state (like replacing addresses with their indices within the enumeration).

Further investigations:

We would like to check the hypothesis that the sync time grows linearly with the state size, and if so, establish the coefficient of such linear function.

Solutions:

Reduction of the state size, or at least the rate of its growth, should help limiting the problem, and allow the client software optimisations (such as advanced sync mechanisms) to catch up.

Slower block sealing

To appear in the part 2 or later

Slower processing of transactions reading from the state

To appear in the part 2 or later

Block gas limit increase and the State fees (formerly known as State rent) share initial steps

To appear in the part 2 or later

Stateless contract pattern is discouraged by the current gas schedule

To appear in the part 2 or later

eWASM interpreters could be a sensible first change even though gas cost might not be practical in the beginning

To appear in the part 2 or later

Chain pruning will become more relevant as we start constraining the state growth

To appear in the part 2 or later

Ethereum protocol changes do not need to take a year to be prepared

To appear in the part 2 or later