Is Ethereum 2.0 overlooking communication costs?
Serenity is a term used to describe a series of potential updates to Ethereum, which attempt to address scalability with minimal impact on decentralization and security. In simpler terms to make it a faster and better environment for smart contracts and DApps.
Ethereum tries to overcome the energy waste and scalability limitations of the original Proof of Work (PoW) solution by rethinking its entire architecture, using Proof of Stake (PoS) and Sharding as main pillars. The new Ethereum architecture is a multi-layered system, which is meant to be integrated — and run in parallel — with the existing PoW chain until PoW can be deprecated and replaced completely.
While reviewing Ethereum 2.0’s available documentation, some unaddressed issues and limitations became obvious. This article examines Serenity’s design rationale, highlights some of the identified issues, offers solutions, and in the end contrasts them with our approach at Elrond.
Making sense of Ethereum 2.0
Actors in Serenity
There are several actors in Serenity: beacon nodes that do most of the work in the system, and staking nodes (collectively called validators) which can be selected as proposers or attesters for the shard chains. Validators register to the network through a Smart Contract maintained on the PoW chain, by staking 32 ETH.
Most of the work in the system seems to fall on the beacon nodes, that have the following responsibilities:
- Maintain the list of active, queued and exited validators, by monitoring the registration smart contract in the main chain (PoW).
- Generate a randomness or act as randomness source through RANDAO/VDF (Verifiable Delay Function)
- Shuffle the validators into attester committees, both for the beacon chain and for the shards, and allocate slots for block proposers. The shuffling is done using the randomness source from the beacon chain.
- Process beacon blocks, build the beacon chain with cross-links (metadata for a block in the shard chains, and can be viewed simplistically as block headers with aggregated signatures and shard information) from each of the shards and link to PoW blocks.
- Run the per slot consensus and the finality gadget (Casper FFG).
The proposers are shuffled in cycles by the beacon nodes, each cycle having 64 slots (could be more), with one proposer allocated for each slot. The proposers build blocks with transactions from their shard and cross-links for the beacon chain, in their allocated slot and propagate it (through a p2p network) to the validators in shard. The attester committee will then attest (using BLS multi-signatures) for the block they approve of, and if the block has enough signatures, it can be added as a cross-link in a beacon chain block. The beacon chain block will be as well attested by a committee of attesters.
The attesters are shuffled into different committees for regular shards or for the beacon chain, according to the random source from the beacon nodes, and need to attest (sign) the blocks they approve of. Prysmatic Labs’ implementation considerations suggests that each validator(proposers and attester) node will be connected to one or more beacon nodes directly and communicate with them directly through RPC (gRPC). Connection and communication inside the shard should work through p2p.
Justin Drake suggests in a post that proposers and attesters are gradually shuffled every epoch, so that liveness is not affected. This can greatly improve the liveness and decrease the communication overhead.
Ethereum is building the randomness source based on RANDAO and VDF. RANDAO will be used as a bootstrapping mechanism, before the ASICs running VDF are available (not before year 2020). After the ASICs are integrated with the beacon chain, RANDAO will be used for generating the input for the VDF. If the VDF is broken, it can fallback to RANDAO for the randomness generation.
RANDAO is a commit-reveal mechanism where the validators need to commit to a random number (32 bytes) based on a secret (also 32 bytes) which is the innermost value in a hash onion R = H(H(H(…H(secret)…))), and reveal each time they propose a block a preimage H^-1(R), by unwrapping layers of the hash onion. This is then XORed with the previous beacon chain RANDAO result and serves as a local biasable entropy for the beacon chain, to be further built upon by the other block proposers. Each block proposer in their slot has the choice to either reveal or skip, but skipping means also forfeiting their rewards that come from proposing a block.
The VDF is used to generate an unbiasable random number taking a 32 bytes input and generating a 32 bytes output, that takes time T to generate, but is orders of magnitudes faster to verify. When the estimated time T is longer than the time required to commit to it’s input, then this function becomes unbiasable. Ethereum is planning to use ASICs specialized in calculating the VDF running an algorithm based probably on Sloth++ with STARKs.
The Serenity sharding implementation assumes that each shard manages separate accounts and is processing different transactions, each building their own blockchain, meaning that it has both state and transaction sharding. It supports a large number of shards (up to 1024) that process different transactions in parallel, and a beacon chain shard summarizing the blocks produced by every shard in the system. The number of shards at a time can vary so the sharding solution is adaptive/dynamic.
The nodes forming the beacon shard are specialized nodes, they have higher HW requirements (storage, bandwidth, processing power) and each of them is connected directly to validator nodes from one or multiple shards ensuring as well the synchronization of these nodes to the latest blocks in the beacon chain. Validators inside shards are connected through the p2p network. This achieves network sharding by optimizing communication over the network to interested parties. Beacon nodes will propagate the attested beacon blocks on the canonical chain to all beacon nodes and proposers in the network, while the propagation of a proposed block for shard X will only be done inside shard X.
In Serenity, the timeline is split into epochs, each epoch being further split into slots. In each slot (estimated at 8 seconds) there is a different proposer. It is not clear yet how many slots will be in an epoch, but there are some computations done with 64 (Prysmatic’s consideration), 128 or 1024(Justin) slots, which would translate into an approximate 8, 16, 136 minutes epoch duration.
In a sequence of slots, the ordering of proposers and the members of attester committees for each shard is known minutes before the assignment and comes from the beacon chain. It is computed using a reshuffling algorithm based on the beacon chain randomness source. Moreover, the same attester may be allocated to multiple shards at the same time, depending on the amount of stake, increasing the hw requirements for the attester as it will have to synchronize and process more data.
Trusted or trustless?
The shard validators (proposers, attesters) need to be connected to some beacon node to synchronize, while communication between validator nodes inside a shard is done through a p2p network.
In case of cross shard transactions, actions need to be taken in multiple shards, the communication required for this could be managed through the beacon nodes, but this would become a huge overhead for the beacon nodes which already have lots of responsibilities. It is also not clear how the beacon nodes are incentivized to operate, as the staking scheme seems to target the validators, so it could end up with a small number of beacon nodes each connected to a large number of validators.
The implementation choice of having the communication between validators and their beacon nodes directly through RPC, would imply some trust assumptions from the validators to the beacon nodes. This means all information that is fed to validators can be controlled/censored by the beacon nodes, and if the number of such nodes is small (especially when there are high requirements on hardware and not good enough incentive) then it could lead to some form of centralization, and possibly security vulnerabilities which need further analysis.
Communication and storage cost
The communication and storage cost seems to be high on the beacon chain nodes, especially when there is a high number of shards, due to the requirement to synchronize multiple changing validators through direct connections, and (possibly) facilitate cross shard operations. There is also a limited time for synchronization, with only minutes before assignment of attesters, which further increases the requirements for bandwidth in order to keep up with all the communication.
The high availability of the beacon nodes is also a must as there will be multiple attester nodes requiring the data in due time, more so for clients that connect to a single beacon node.
Attesters also have high communication needs in order to validate blocks from possibly multiple shards simultaneously (if they have seats in more than one shard), while required to synchronize the shards’ state in a limited time (in the order of minutes, which may be enough for checking data availability but not enough for full checks, that require more state information).
Depending on the bandwidth of the validators and beacon nodes, the number of shards each beacon node maintains data for, the number of validators the beacon node is connected to and also the number of slots in an epoch, there could be an impact on the network liveness.
Research is ongoing for checking for data availability, so that attesters do not need to synchronize all data and can only check if data is available and then rely on fraud proofs for validation. This is done mostly to enable light clients e.g. mobile phones to act as attesters, but there still needs to be a majority of nodes that do full checks, execute transactions (executors) in each shard and create the fraud proofs for the light clients.
In this case the problem of communication and storage overhead still stands, the optimization done is only allowing other types of devices to act as attesters without too many requirements on the hardware, but the security of the network must not rely only on such devices.
How feasible are mobile phones as validators?
Due to all the processing required and especially the availability assumptions and communication cost, not all types of device can fill the role of a beacon node. For attesters, there is active research for checking data availability instead of downloading it, that could enable light clients such as mobile phones to fill this role.
There is still the problem of online availability in the case of mobile phones as light clients, because of the intermittent connection to the internet, drain of battery due to data transfers and processing, etc. The unreliable accessibility could affect the protocol when these devices get selected into attester committees and are not responsive. In this situation the mobile phones don’t seem a good fit for this role and could, instead of helping the protocol with the linear scalability coming from sharding, adversely affect its functionality due to their unpredictable and intermittent availability.
The documentation mentions that attesters and proposers are randomly shuffled according to the randomness source from the beacon chain in each epoch. Because it is done only once per epoch there are more attack vectors (like DDoS-ing). When the shuffling happens, the communication cost is not trivial, each validator needs to download the required data for all new shards it has a seat as attester in, within a limited time, before it can attest to blocks. This could again prove counterproductive to manage with mobile phones which may not all have a high bandwidth and cannot be sustained for long, as intensive processing and communication could drain the battery fast.
Leaving aside the responsiveness (high online availability) requirements, if smart phones would act as attesters, considering only the sheer number of such devices that could be targeted as light clients, we could end in a situation where the majority of the nodes in the network are light clients. Even though it can be argued that the security that light nodes can provide is almost at the same level with the full nodes, they still require the fraud proofs from fully validating nodes. It is not clear enough how the security of a shard or the network can be affected when the number of light nodes is much greater than the number of regular nodes.
As a conclusion, mobile phones could still be used in blockchains but not as validators. They need to fill a different role, where the intermittent availability for the majority of such devices is not an issue, where they can work with lower bandwidth and intensive computations are not required.
There is a high probability of forks appearing in Serenity shards so there needs to be some fork choice rule to decide the canonical chain if this happens. Serenity uses Latest Message Driven fork choice rule, where the chain starting from genesis, with the largest subset of unique attester votes is selected as canonical. The finalization is done then by Casper FFG in the beacon chain, where it adds cross-links for the blocks that already have 2/3 attestations (for them or for their descendants).
How does Elrond compare?
One of the most important considerations when building a state sharding solution for blockchains is the communication cost. It is critical to have it as optimized as possible in order to be truly scalable.
Validators shuffling cost
Regarding the way validators are assigned to shards, Elrond is doing random shuffling of validators, but in order to optimize communication and storage cost, only up to 1/3 validators are shuffled from every shard at the end of each epoch. Because the allocation of nodes to shards is also buffered, there is actually no liveness penalty. Serenity could achieve the same effect of continuous liveness with gradual reshuffling of older validators, if the number of slots per cycle/epoch (time until next reshuffle) is large enough.
Serenity defines the duration of one epoch as a fixed number of slots considered for proposer assignment, each lasting for 8 seconds. There is one proposer assigned per slot in each shard, and the configuration of these slots is known for the entire epoch in advance, so proposers are D Globally predictable, where D is the number of slots in the epoch. The same happens also for the attesters, although the time these are exposed could be less than for the block proposers. Elrond employs random sampling of validators and block proposer every round from the shard nodes. This way the time the validators are exposed is limited to one round (5 seconds), so only 1 Globally Predictable, which increases the security of Elrond against bribing and DDoS.
Staking role in consensus group selection
Elrond has a selection of validators for consensus groups based on the stake and rating. The stake can give a validator more chances of being selected into a consensus group but never more than one seat at a time. There is also a maximum stake after which the chances will no longer increase. This requirement could help with the creation of extra nodes to secure and scale the network. Another reason for this is again to optimize the communication and storage, as it is only possible to be a validator in one shard at a time per validator/machine. Serenity allows multiple seats for validators which may cause a selection of a validator as attester in multiple shards at the same time, with penalties on communication and storage (communication and storage requirements proportional with the number of seats).
Nodes specialization. Is it a good thing?
Some parts of Serenity sharding model are similar with Elrond’s proposal. In Elrond there is one shard maintaining the notarization of block hashes from all shards, which looks similar to what the cross-links are doing. In Serenity right now there is no clear incentive to run a beacon node, also beacon nodes have much higher requirements so it might be possible that running a beacon node may not be very attractive for users. In Elrond however, any node that stakes can act as a validator/block proposer for the metachain or shards. This can improve security as it will always be possible to assign a sufficient number of nodes in the metachain to maintain security.
While in Serenity there are some trust assumptions and higher hw requirements for the beacon nodes, as they need to feed the data to their directly connected validator nodes, in Elrond any validator node can fill any role, of a metachain node or a shard node, validator or proposer. Elrond’s p2p network topology holds both intra-shard connections and inter-shard connections, allowing Elrond to run in a minimal trust setup.
The target clients for validators in Elrond are the consumer grade computers, there is also the intention of adding another role, that could be filled by smartphones, but this is still in research so it will be left out of this analysis.
Both Serenity and Elrond implement an adaptive solution including transaction, network and state sharding. Serenity has no clear model for ensuring the communication cost is optimized when new shards are formed or removed. This case is the most communication intensive for the network, as a lot of nodes would need to update their information, transfer the state related to the accounts managed by the new shards to the nodes allocated to the new shards, etc.
One approach taken into consideration by Serenity is to wait until the number of registering nodes is enough for doubling the shards, before changing the shard count, in order to minimize such network reorganizations. The drawback of this approach is that, after doubling the shards a number of times, it will require a long time and a large number of nodes to increase (double) the number of shards yet again. In this case in order to include the new nodes into the slot assignment model, it would be required that the number of slots per epoch is not constant, but can vary depending on the number of nodes in the system.
Another option is that the newly registered validators need to wait until the number of shards increases and they can be accommodated. But doing this when a large number of nodes is required to double the shards would mean that nodes lock their stake but are put to wait and cannot earn anything for an indefinite time. This would chase away potential new nodes from the start and detrimentally affect the growth of the network.
With Elrond we have a clear model of how the network reacts to the cases of adding or removing shards. We are using shard splits for adding shards and shard merges for removing shards. The communication cost of splitting one shard is virtually zero, and the cost of the merge can be optimized as well (for more details how this is done check our whitepaper). The network reacts fast to the registration of new nodes, which can already be added to one of the available shards and can start processing after one epoch of synchronization.
Moreover, because in Elrond the communication cost of shard splits is zero, there is no reason to delay adding new shards, we can leverage the linear scalability sharding can provide, so shard splits can occur much faster than in Serenity, whenever a new shard can be created. For example, if we need at least N nodes per shard, then whenever we have X*N(1 +(1/10)) we could already form X shards through splitting, and in case we go below X*N nodes do a shard merge.
The same shard split and merge model Elrond uses could as well be used in Serenity to improve how fast the network scales. Even if the number of slots is not dynamically adjusted, the waiting time for registering nodes can be considerably decreased.
Serenity is still using the PoW chain, where the registration smart contract resides, meaning that even though it implements also PoS chains, as a whole it is still wasting energy for the time being. Elrond is fully PoS so energy efficient.
In Serenity forks are likely to happen in the shard chains, while in the beacon chain the blocks are finalized with a high probability (for finality there is still a dependency on the PoW blocks referenced by the beacon blocks). Forking in the shard chains could add more complexity for cross shard transactions, which are the majority in a sharded system.
Elrond uses a pBFT like consensus, in conjunction with a Schnorr multi-signature scheme. The pBFT is done on a randomly sampled consensus group smaller than the size of the shard. The shard size is chosen sufficiently large so that it provides both reasonably fast processing and a probabilistic unfeasible long time to critical failure (where ratio of malicious nodes in consensus group is >2/3) which is also the only forking condition in the system.
Our system is able to recover naturally from consensus stalls, and even though probability for critical failure is almost impossible, there is also a mechanism in place where the system can recover, with a bit of overhead.
This article brings to light the importance of reducing communication costs when designing a viable scalable solution and contests trade-offs which could eventually lead to centralization issues.
We have outlined a brief analysis of Ethereum’s 2.0’s proposal and have talked about several issues that could have serious subsequent effects if not addressed properly. We’ve contrasted this with Elrond’s overall approach and have made several suggestions that could improve their proposal.
We are very excited about what the future holds for Elrond and we have a significant amount of updates to announce over the coming weeks, so please subscribe to all of the following social channels.