Ethereum Sharding inside a Supercomputer

Published in

Coinmonks

7 min readApr 14, 2019

Humanity has used modeling and simulation for centuries to try to understand and predict the behavior of complex systems. From mathematical models predicting projectile trajectories, to generals recreating battles on a map, simulation is one of the best tools to explore multiple alternatives while trying to figure out the best solution to a problem. In the last few decades, computer simulations have considerably increased the complexity of the models that we can study, and cutting-edge technology, like supercomputers take the simulation art to the next level. This is one of the directions in which the Ethereum Foundation decided to invest some effort during the first wave of Ethereum grants, released one year ago. The Barcelona Supercomputing Center (BSC) proposed to develop a simulator of a sharded blockchain that could run inside a supercomputer. More specifically, the idea was to implement a partial version of the sharding specification designed by the Ethereum Foundation researchers, and simulate the sharded blockchain inside the Marenostrum Supercomputer at the Barcelona Supercomputer Center. The Marenostrum supercomputer has 3456 computing nodes, each one with two Intel Xeon Platinum processors with 24 cores each. This adds to over 150 000 cores, which together with the 390 Terabytes of memory gives a peak performance of 11.15 Petaflops.

*Figure 1 — Marenostrum 4 Supercomputer at BSC*

Objectives of the Simulations

Ethereum has a tesnet deployed in which many tests are routinely performed before rolling out large changes in the live network. Thus, one could ask, wouldn’t the testnet be enough for testing sharding as well? Is there really a need for more complex simulations? There are several reasons why running simulations is beneficial for large endeavors, such as blockchain scalability. They can be summarized as follows:

In a simulator it is much easier to set parameters such as the peer-to-peer (p2p) network topology, the network latency, the network connectivity, the number of slow peers, the number of malicious peers and execute hundreds of different scenarios with a large combination of different parameters.
Simulators allow us to run crazy scenarios to test parameters over the threshold limit that could (and most likely will) make the system completely collapse. Scenarios such as predictive randomness advantage, extremely low validators count, among other critical cases can be executed and evaluated much faster with a simulator.
Testing features such as finality and validator slashing in real time can be painfully slow, given that some of those time periods take hours or even days. Simulations, on the other hand, can accelerate time and run orders of magnitude faster than the wall-clock time. This feature can be handy when testing some of those long term consequences of the protocol.

All this is not to say that simulators are much better than the testnet, or that the testnet is unnecessary. The testnet provides many ways to test things, sometimes in a much accurate way than in the controlled setup of a simulator. Therefore, one method is not better than the other one, they are in fact complementary. The Ethereum Foundation knows this and that is why efforts on both directions are taken.

Current Simulation Results

The Ethereum Foundation researchers have explored multiple ways to implement sharding and the specification continues to evolve today. These different avenues have been explored in two meetings in which BSC researchers have participated as well as all the mayor Ethereum 2.0 clients; a first meeting in Taipei and a second one in Berlin, in addition a few meetings were also held during DevCon4 in Prague. A substantial part of the protocol has evolved over this year and thanks to the continuous effort of the researchers, we seem to be closer to a definitive version of what is expected to be Phase 0 of Ethereum 2.0. Phase 0 involves the deployment of the Beacon Chain, the heart beat of Ethereum 2.0, which is also responsible for a critical part of implementing sharding in a secure way: Randomness.

*Figure 2 — Number of Peers for each node in the simulated P2P Network*

This was the first objective of the BSC team, to implement a simulator of the Ethereum Blockchain that includes the Beacon Chain. After several months of work, the first Proof of Concept has been implemented and the first simulations have been run in the supercomputer. The current prototype not only runs the simulation but it also performs a full post-processing of the simulation and produces a full report that can be analyzed with any web browser. The report generated provides detailed information and statistics such as the number of peers for each node in the P2P network (Figure 2). For this particular simulation, the red line shows that the nodes have 8 peers in average, the minimum being 4 peers and the maximum being 14 peers. Simulations with less number of peers were executed and showed networks with lower connectivity, leading to a higher block propagation latency and therefore to higher uncle rates. Such simulator results are consistent with the behavior one would expect from a low connectivity P2P network.

In addition, simulations were ran trying to imitate the behavior of the current Ethereum 1.0 network. For instance, when looking at the Ethereum 1.0 network statistics we know that the average block time is about 16 seconds, but not all blocks take 16 seconds to be produced, some of them take under 5 seconds while others approach the minute, as can be observed in Figure 3.a. The same behavior was replicated with the simulator running in the supercomputer Marenostrum (See Figure 3.b) That particular simulation ran an average block time slightly over 15 seconds while a few blocks took over 40 seconds and some others under 5 seconds, showing the same distribution observed in the eth1.0 live network.

The Proof of Work (PoW) chain is not the main feature researchers want to study, but rather the behavior of the beacon chain under different conditions. The simulator also runs the beacon chain in parallel with the PoW chain. In the beacon chain, a beacon block is produced at each slot, that is to say every 6 seconds according to the current specification (See Figure 4). The simulator provides a comprehensive report including, the detailed list of blocks mined in the main chain, as well as the beacon chain blocks, and the full list of uncle blocks, with their hash, parent and miner. Each node produces a detailed report showing the number of messages sent and receive at each time step, which allows us to study the network traffic and flooding networks patterns. Furthermore, each node performs detailed logging of the main events observed during the simulation and writes the full log which is available for the researchers to study. An example of it can be found here.

An Error showing Correct Behavior

During the development of the Beacon Chain protocol in the simulator, an implementation error was done; given the random protocol for choosing one validator for each epoch, it was assumed that beacon chain uncle blocks could not exist. The simulations started producing very weird results, leading in some cases to some nodes to stale and stop progressing on the chain. This was due to an exponential explosion in the number of messages that was rendering the involved nodes completely incapable to progress. Further analysis showed that this issue only occurred in the presence of uncle blocks in the beacon chain, which could occur under strong latency or other network conditions that could render the protocol temporarily unstable. The issue was solved by correctly handling uncle blocks in the beacon chain. This issue was later discussed with the Ethereum Foundation researchers in one of the Ethereum 2.0 conf-calls and they confirmed that uncle blocks could occur on the beacon chain and that they had to be properly handled as proposed in the specification.

While this was not really a specification error, but a rather an implementation one, these experiences help to provide feedback to clients implementers, and it shows the extreme behavior that can be observed when inadvertent errors are made or when attacks occur. The current prototype is still evolving together with the Ethereum 2.0 specification and it still requires significant work before becoming a fully sharded blockchain simulator. Nonetheless, it is an important milestone to have a simulator that runs in a fully distributed system (in a supercomputer with thousands of compute nodes) and researchers hope that this simulator will help them study complex cases in the next phases of the sharding implementation roadmap.

Future Steps

The BSC team continues to work into achieving the most accurate sharded blockchain simulator that could run in a distributed setting at large scale. Ideally, the simulator should be able to run thousands of nodes and performs different types of attacks to test the robustness and reliability of the Ethereum Sharding specification. The current Proof of concept is called shardSim and it is available open source on github. In the coming months, the researchers hope to implement different network topologies as well as different protocols from other blockchains which could allow researchers to compare different strategies at large scale.

Get Best Software Deals Directly In Your Inbox

Ethereum Sharding inside a Supercomputer

Written by Leonardo Bautista Gomez