Eth2 Mainnet Incident Retrospective

Written by the Prysmatic Labs Team

Raul Jordan
Apr 28 · 13 min read
https://beaconcha.in/epoch/32302

Incident Summary

At epoch 32302, the beacon chain started missing a lot of block proposals. Prysm was the likely suspect since Prysm represents a large portion of Eth2 operators. After a short while, we were able to reproduce the error locally and the problem was indicative of a known issue regarding eth1 data voting and validator deposits. While this issue had been reported to us in the past, we could not reproduce the bug and considered it to be an isolated incident. The issue had never been widespread in any testnet nor in mainnet. This time was the first time we believe it caused a failure in block proposals.

Impact

Some preliminary data indicates that during the first incident, the average loss was 122950 gwei per validator affected ($0.30 USD at today’s prices). The second time the incident happened within 24 hours of the first one, the estimated losses were around $0.22 per validator affected.

  • Beacon chain continued to finalize with no gaps in finality
  • Participation was still high (84.8%, at the lowest point)
  • Most validators missed 2 or 3 attestations, regardless of client type
  • This was not likely a malicious attack

Clarifying Questions

Should this incident reduce confidence in Eth2?

No. There was no consensus failure, and the impact and scope of the incident was small at the scale of current Eth2 mainnet ($0.30 in USD terms for every validator at the end of the first instance of the problem). Eth2 has been extremely robust, with very high validator participation numbers since genesis, and has never skipped finality for any epoch. The ability for the network to go back to perfect operation after incident resolution instead increases confidence in the resilience of the blockchain, from our perspective.

Should this incident reduce confidence in the Prysmatic Labs team?

The way we reacted to and resolved this incident was completely different from the previous incident in the Eth2 testnet described here. This time, dismissed misinformation right away, quantified impact, and enumerated clear steps for stakers to take while awaiting resolution. Moreover, we waited until being fully certain of a solution before encouraging all stakers to upgrade. It is important to note that as a result of being the majority software for stakers, any bugs that occur are amplified further.

Summary of the Root Cause

Eth2 is loosely coupled to the ETH1 chain, depending on it only for validator deposit verification. That is, the Eth2 proof-of-stake chain can continue even if validators are voting on junk data. The only thing that will fail is the onboarding of new validator deposits until the chain votes on the correct ETH1 data once again. This “voting” is done in “voting periods”, which are set to periods of 64 epochs on mainnet today (approximately 6.8 hours).

Event Timeline

Warning, technical details ahead! Feel free to skip to the next section to read about the resolution and lessons learned.

Block proposals failing

Epoch 32302 starts having issues with missed proposals.

Investigation showing Prysm voting on weird, corrupt-looking eth1 deposit tree root

We noticed Prysm nodes were voting on a weird deposit tree root, which is used to verify the integrity of deposits with respect to the validator deposit contract which lives in the Ethereum proof of work chain. The initial block proposer’s historical information from public explorers (not mentioned to safeguard identity) led us to infer this was not an attempted attack.

Process of elimination

Initial suspicions came around how Prysm handles eth1 data voting in the validator proposal’s code path. In particular, there were a few questions we tried to eliminate:

  1. Is our fetching of deposit log information and eth1 info messed up or non-deterministic?
  2. Perhaps something is wrong with our deposit tree?

Plausible root cause

Learning from a previous incident we saw in the Eth2 testnet, we learned that having confidence in a root cause is not enough. We need to have 100% confidence before announcing a resolution to our users in times of high stakes. At the 28 hour mark, we sat down and questioned: “What is it that we still do not know? What questions can we ask to get us closer to a root cause?” It turns out we knew that:

  1. Our code paths for retrieving ETH1 data from an ETH1 node was not flawed nor returning improper data
  1. Why the issue is reproducible in some nodes vs. others
  2. Why Prysm nodes had an “off-by-one” error in determining the number of deposits in blocks

Fixing the Problem

To answer these questions, we looked at the code path that initializes our deposit tree. It turns out, a caching layer was added early on to avoid stakers having to download all validator deposit logs every time they start their node. Moreover, we added a recent feature of being able to start Prysm from an embedded genesis state in the client itself. When filling up the cache, a bad assumption about our deposit tree causes corruption of information:

The culprit

Resolution

Root cause summary

  • Prysm persists eth1data on disk to prevent users having to request the validator deposit contract logs every time their process restarts
  • If a node restarts and has eth1data on disk, we initialize our deposit cache from this data, but due to a discrepancy between how our sparse merkle tree helper package works, and the code path for initializing this cache from data on disk, we would skip inserting the 0th deposit into the tree, leading to invalid deposit tree roots. This code path only affects nodes that have not had their database since genesis, and has since been fixed.
  • Prysm nodes implement an eth1data voting algorithm in the official specification that “goes with the majority”, however, Prysm was not fully implementing some validity conditions for that algorithm. Prysm nodes vote with the majority eth1data vote that references an existing block root, which could lead to Prysm nodes agreeing on a deposit tree hash produced by a node with a corrupt deposit tree, as deposits are not verified.
  • As a large portion of nodes in the network are Prysm nodes, the snowball effect of voting with the majority on a corrupt deposit root grew into a critical problem as Prysm nodes were then unable to produce blocks for a period of time on mainnet
  • Once the eth1data voting period resets, Prysm nodes go back to proposing blocks correctly, until the bug is encountered again in a future period.

Solution

At 05:00 UTC on Sunday the 25th, we released a fix to the problem after many grueling hours of uncertainty. We have total certainty in the resolution and were confident the problem will never arise again in Eth2 after nodes have upgraded.

Lessons Learned

Confidence in our resolution and careful external communication was critical during the incident

When we suffered the Medalla testnet incident for Eth2, we learned a serious lesson on the value of good communication. Every public comment and the precision of the language used can have serious effects on outcomes of incidents. In the testnet problem, we believed an immediate resolution was to tell everyone via our public channels to “restart your nodes”. This rash judgment led to a majority of the network going offline and then scrambling to find good peers in a sea of bad ones to synchronize the chain. Moreover, we were quick to issue a software upgrade hotfix without having 100% confidence it would resolve the problem. This led to more chaos in the system, and led to concerns from node operators about a resolution.

Patience and calm allowed for an expedient resolution

Our team has learned of lessons over the last few years on building Eth2 when it comes to remaining calm in the face of adversity. We believe remaining calm, having frequent communication of status reports, and ensuring the team feels support and positive feedback during the resolution process is critical. We were able to take our time to gather as much evidence as possible and work carefully with our users, we will succeed at resolving the problem. More importantly, we took time at the start of the incident to quantify the impact to ease concerns from stakers and limit misinformation. This lesson is incredibly important when operating in high stress situations with little sleep. Take your time, fix it properly, and avoid making the problem worse at all costs.

Eth2 Testnets do not represent mainnet scenarios

At Prysm, we conduct extensive testing and monitoring of Prysm pre-production release candidates in public Eth2 testnets. Prater and Pyrmont testnets are a great way for users to test their setup before joining the mainnet Eth2 and client teams are able to test their clients at a larger scale than mainnet. However, these testnets assume a near perfect split between 4 production Eth2 clients such that no client has an obvious majority share of the validators. Unfortunately, this may not shine light on bugs that may only manifest in a single client or represent a scenario when any particular client is a majority client. Going forward, Prysmatic Labs will be operating an internal test network that more closely resembles a mainnet environment or at least an environment where Prysm represents more than 50% of the network.

Takeaways for Stakers

Why Stake With Prysm

https://launchpad.ethereum.org
  • We are not only going to improve this experience, but double down on making Prysm many times better than it is today, even easier for stakers to participate using our client, including a revamped web interface
  • Prysm is going to be doubling down on R&D efforts, providing pivotal features and improvements ahead of the eth1 <> eth2 merge
  • We believe healthy competition is a strong incentive to keep making ETH proof of stake more accessible and therefore more secure, as all client teams keep improving their software
  • Our team is committed to the highest professional standards required to resolve and tend to issues stakers might encounter. We believe we are prepared to handle anything that comes our way and reassure our community that we treat our stakers’ experience as highest priority
  • Finally, there are many important features in our pipeline that we believe will make Prysm an even more attractive software to use when participating in Eth2, and will never stop iterating on this goal
  • Prysm has some very advanced optimizations for validator profitability that are not yet enabled by default for all stakers. We are confident Prysm stakers will see top-tier profitability upon the release of these features to all

Revisiting the client diversity conversation

A common theme we have heard since the start of Eth2 is the concept of client diversity. Eth2 is a distributed system with many people around the world participating as validators. Different people use different software to participate in the consensus of the blockchain, and if one particular software has a serious problem, the impact will be less severe if there were a fair distribution of client implementations running the network.

https://github.com/leobago/BSC-ETH2/tree/master/armiarma

References

Prysmatic Labs

Implementing Ethereum 2.0 - Full Proof of Stake + Sharding

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store