Eth2 Mainnet Incident Retrospective
At epoch 32302, the beacon chain started missing a lot of block proposals. Prysm was the likely suspect since Prysm represents a large portion of Eth2 operators. After a short while, we were able to reproduce the error locally and the problem was indicative of a known issue regarding eth1 data voting and validator deposits. While this issue had been reported to us in the past, we could not reproduce the bug and considered it to be an isolated incident. The issue had never been widespread in any testnet nor in mainnet. This time was the first time we believe it caused a failure in block proposals.
During a period of 18 epochs, nearly all Prysm Beacon Nodes were unable to produce new blocks. Epoch 32320 started behaving normally and the incident was considered to be over at that time. Around 24 hours later, the incident manifested itself again with similar impact.
A formal post-mortem report on the incident is published here as a companion to this post. This retrospective details the event timeline, root cause analysis, and takeaways for stakers and participants in Eth2.
Some preliminary data indicates that during the first incident, the average loss was 122950 gwei per validator affected ($0.30 USD at today’s prices). The second time the incident happened within 24 hours of the first one, the estimated losses were around $0.22 per validator affected.
- No validators were slashed
- Beacon chain continued to finalize with no gaps in finality
- Participation was still high (84.8%, at the lowest point)
- Most validators missed 2 or 3 attestations, regardless of client type
- This was not likely a malicious attack
After an all-hands-on-deck period of approximately 30 hours from the entire team, we diagnosed the root cause and deployed a fix for all Prysm nodes at Sunday morning, 6:00AM UTC. The incident happened one final time while nodes were not fully upgraded. After giving node operators enough time to upgrade, the incident has not been observed again, and evidence points to the problem being fully resolved.
Should this incident reduce confidence in Eth2?
No. There was no consensus failure, and the impact and scope of the incident was small at the scale of current Eth2 mainnet ($0.30 in USD terms for every validator at the end of the first instance of the problem). Eth2 has been extremely robust, with very high validator participation numbers since genesis, and has never skipped finality for any epoch. The ability for the network to go back to perfect operation after incident resolution instead increases confidence in the resilience of the blockchain, from our perspective.
Should this incident reduce confidence in the Prysmatic Labs team?
The way we reacted to and resolved this incident was completely different from the previous incident in the Eth2 testnet described here. This time, dismissed misinformation right away, quantified impact, and enumerated clear steps for stakers to take while awaiting resolution. Moreover, we waited until being fully certain of a solution before encouraging all stakers to upgrade. It is important to note that as a result of being the majority software for stakers, any bugs that occur are amplified further.
A key part of the job of a core developer is to bound complexity. A distributed system such as Eth2 has so many variables, and every team works as hard as possible to control what they can. Bugs are inevitable in software like this, and yes, we made a mistake. It’s unfortunate that prysm had this bug, but we hope that we demonstrated our motivation and ability to resolve it while balancing speed and accuracy to all stakers.
Summary of the Root Cause
Eth2 is loosely coupled to the ETH1 chain, depending on it only for validator deposit verification. That is, the Eth2 proof-of-stake chain can continue even if validators are voting on junk data. The only thing that will fail is the onboarding of new validator deposits until the chain votes on the correct ETH1 data once again. This “voting” is done in “voting periods”, which are set to periods of 64 epochs on mainnet today (approximately 6.8 hours).
The way voting works is a simple majority rule, and the Eth2 specification for validators explains how this should work. Unfortunately, Prysm’s implementation of “voting with the majority” was missing some validation. What happened in the incident was a bug within Prysm led a block proposer to create a completely invalid ETH1 deposit tree root, and other Prysm nodes were the first to see it. Then, they would vote on it as Prysm was following a simple “voting with the majority” rule without explicit validation.
The effect of all Prysm nodes then “snowballing” into voting on the invalid information led to block proposers unable to include blocks with deposits in the chain, as the proposal would fail due to the deposits not verifying with respect to those nodes’ idea of the ETH1 deposit tree root. The incident resolved itself after the end of a voting period, but would continue to happen once more if left unanswered.
The actual root cause of the “corrupted” ETH1 data deposit tree root was due to a bug in the cache initialization for deposits affecting only a subset of beacon nodes using Prysm. This led those nodes to produce bad deposit tree roots, which then other Prysm nodes voted on, causing the incident.
Warning, technical details ahead! Feel free to skip to the next section to read about the resolution and lessons learned.
Block proposals failing
Epoch 32302 starts having issues with missed proposals.
Nishant notifies the team and an all hands meeting is called. We are then able to reproduce via a local, mainnet beacon node and begin our investigation.
Investigation showing Prysm voting on weird, corrupt-looking eth1 deposit tree root
We noticed Prysm nodes were voting on a weird deposit tree root, which is used to verify the integrity of deposits with respect to the validator deposit contract which lives in the Ethereum proof of work chain. The initial block proposer’s historical information from public explorers (not mentioned to safeguard identity) led us to infer this was not an attempted attack.
Process of elimination
Initial suspicions came around how Prysm handles eth1 data voting in the validator proposal’s code path. In particular, there were a few questions we tried to eliminate:
- Are we having an issue with packing deposits into blocks?
- Is our fetching of deposit log information and eth1 info messed up or non-deterministic?
- Perhaps something is wrong with our deposit tree?
Over the next 16 hours or so, we would spend a lot of combined effort diagnosing potential flaws. We combed over lines of code, attempted to reproduce failures via unit tests, and tried a myriad of approaches. Even when we had a potential solution, we were nervous about releasing it due to a lack of confidence.
Plausible root cause
Learning from a previous incident we saw in the Eth2 testnet, we learned that having confidence in a root cause is not enough. We need to have 100% confidence before announcing a resolution to our users in times of high stakes. At the 28 hour mark, we sat down and questioned: “What is it that we still do not know? What questions can we ask to get us closer to a root cause?” It turns out we knew that:
- Our sparse merkle tree implementation does not have a serious bug, as it matches lighthouse and Protolambda’s zrnt implementation of Eth2 using deposits from mainnet and the Prater testnet
- Our code paths for retrieving ETH1 data from an ETH1 node was not flawed nor returning improper data
What we didn’t know was:
- How the “corrupted” deposit tree root came about
- Why the issue is reproducible in some nodes vs. others
- Why Prysm nodes had an “off-by-one” error in determining the number of deposits in blocks
Fixing the Problem
To answer these questions, we looked at the code path that initializes our deposit tree. It turns out, a caching layer was added early on to avoid stakers having to download all validator deposit logs every time they start their node. Moreover, we added a recent feature of being able to start Prysm from an embedded genesis state in the client itself. When filling up the cache, a bad assumption about our deposit tree causes corruption of information:
It turns out, if our deposit tree is empty, len(items) will always return 1. This means that we are setting our
lastReceivedMerkleIndex value to 0 when we should actually be setting it to -1. The code above would cause some Prysm nodes in that code path to skip inserting the 0th deposit into the tree. The rest of our codebase accounts for this quirk of our deposit tree implementation, but not this code path.
To test this hypothesis, we tried to replicate the code path as much as possible using the test fixtures provided to us by Protolambda. We had a hunch that we were skipping insertion of the 0th deposit into the tree. Surely enough, we were able to find the corrupt deposit tree root that started the whole incident in a reproducible test! We then added conditions around that code path to avoid the condition from arising once more, and became finally ready to ship with certainty.
Root cause summary
- Prysm persists eth1data on disk to prevent users having to request the validator deposit contract logs every time their process restarts
- If a node restarts and has eth1data on disk, we initialize our deposit cache from this data, but due to a discrepancy between how our sparse merkle tree helper package works, and the code path for initializing this cache from data on disk, we would skip inserting the 0th deposit into the tree, leading to invalid deposit tree roots. This code path only affects nodes that have not had their database since genesis, and has since been fixed.
- Prysm nodes implement an eth1data voting algorithm in the official specification that “goes with the majority”, however, Prysm was not fully implementing some validity conditions for that algorithm. Prysm nodes vote with the majority eth1data vote that references an existing block root, which could lead to Prysm nodes agreeing on a deposit tree hash produced by a node with a corrupt deposit tree, as deposits are not verified.
- As a large portion of nodes in the network are Prysm nodes, the snowball effect of voting with the majority on a corrupt deposit root grew into a critical problem as Prysm nodes were then unable to produce blocks for a period of time on mainnet
- Once the eth1data voting period resets, Prysm nodes go back to proposing blocks correctly, until the bug is encountered again in a future period.
At 05:00 UTC on Sunday the 25th, we released a fix to the problem after many grueling hours of uncertainty. We have total certainty in the resolution and were confident the problem will never arise again in Eth2 after nodes have upgraded.
Confidence in our resolution and careful external communication was critical during the incident
When we suffered the Medalla testnet incident for Eth2, we learned a serious lesson on the value of good communication. Every public comment and the precision of the language used can have serious effects on outcomes of incidents. In the testnet problem, we believed an immediate resolution was to tell everyone via our public channels to “restart your nodes”. This rash judgment led to a majority of the network going offline and then scrambling to find good peers in a sea of bad ones to synchronize the chain. Moreover, we were quick to issue a software upgrade hotfix without having 100% confidence it would resolve the problem. This led to more chaos in the system, and led to concerns from node operators about a resolution.
In contrast, throughout the entirety of this new incident on mainnet, we were deliberate and precise with communication. Additionally, we did not issue a hotfix until we had 100% confidence in the root cause and a solution to the problem.
Patience and calm allowed for an expedient resolution
Our team has learned of lessons over the last few years on building Eth2 when it comes to remaining calm in the face of adversity. We believe remaining calm, having frequent communication of status reports, and ensuring the team feels support and positive feedback during the resolution process is critical. We were able to take our time to gather as much evidence as possible and work carefully with our users, we will succeed at resolving the problem. More importantly, we took time at the start of the incident to quantify the impact to ease concerns from stakers and limit misinformation. This lesson is incredibly important when operating in high stress situations with little sleep. Take your time, fix it properly, and avoid making the problem worse at all costs.
Eth2 Testnets do not represent mainnet scenarios
At Prysm, we conduct extensive testing and monitoring of Prysm pre-production release candidates in public Eth2 testnets. Prater and Pyrmont testnets are a great way for users to test their setup before joining the mainnet Eth2 and client teams are able to test their clients at a larger scale than mainnet. However, these testnets assume a near perfect split between 4 production Eth2 clients such that no client has an obvious majority share of the validators. Unfortunately, this may not shine light on bugs that may only manifest in a single client or represent a scenario when any particular client is a majority client. Going forward, Prysmatic Labs will be operating an internal test network that more closely resembles a mainnet environment or at least an environment where Prysm represents more than 50% of the network.
Furthermore, we recommend to other client teams to add such an environment to their own internal testing where they can understand potential issues in their own client in the event that they become the majority client.
Takeaways for Stakers
Why Stake With Prysm
People choose to run Prysm because from day one, our team has focused on making the experience of participating in ETH staking easier for them. Time and time again, we have spoken to our users, and the reason many choose a client is not because of micro-optimizations nor relatively small differences in profitability to other software, but because we have made their experiences easy, well-documented, and provided crucial help to all our community members along the way. Eth2 is scary for newcomers, and staking is full of uncertainty and risks. As a team, our mission is to have users know we are there for them, and know they will get support no matter how small their questions are. In particular, we have been focused on the average staker that may not be technical enough with the command line, and may not know what a UNIX operating system is.
Moving forward, here is what you can expect from our team:
- Improve correctness around implementation of specification conditions, ensuring assumptions and validity conditions are always accounted for and questioned before any code is written
- We are not only going to improve this experience, but double down on making Prysm many times better than it is today, even easier for stakers to participate using our client, including a revamped web interface
- Prysm is going to be doubling down on R&D efforts, providing pivotal features and improvements ahead of the eth1 <> eth2 merge
- We believe healthy competition is a strong incentive to keep making ETH proof of stake more accessible and therefore more secure, as all client teams keep improving their software
- Our team is committed to the highest professional standards required to resolve and tend to issues stakers might encounter. We believe we are prepared to handle anything that comes our way and reassure our community that we treat our stakers’ experience as highest priority
- Finally, there are many important features in our pipeline that we believe will make Prysm an even more attractive software to use when participating in Eth2, and will never stop iterating on this goal
- Prysm has some very advanced optimizations for validator profitability that are not yet enabled by default for all stakers. We are confident Prysm stakers will see top-tier profitability upon the release of these features to all
Revisiting the client diversity conversation
A common theme we have heard since the start of Eth2 is the concept of client diversity. Eth2 is a distributed system with many people around the world participating as validators. Different people use different software to participate in the consensus of the blockchain, and if one particular software has a serious problem, the impact will be less severe if there were a fair distribution of client implementations running the network.
A data analysis result from Leonardo Bautista-Gomez back in January showed Prysm nodes made up around 65% of the network, and this incident showed that Prysm validators still comprise a majority today.
We recommend you objectively look at what each client offers: its software, its community, and its resilience and decide which software and team behind it is the most suited to your needs. If a certain Eth2 client is lacking something that is important to you and that is what is preventing you from using their client, then we highly recommend filing a feature request. At Prysmatic Labs, we will keep focusing on helping you as a participant in Ethereum and pushing the boundaries of what’s possible for blockchain software.
Join us on Discord if you want to chat and have any questions about this post.
- Communication on the incident https://www.reddit.com/r/ethstaker/comments/mxpz57/regarding_the_recent_beacon_chain_incident/
- Post-mortem report https://docs.google.com/document/d/1nJr6_bd-UnLBxvhT8lcRYdAZr69QdVQ3zJNUr3LgW-0/edit?usp=sharing
- Medalla testnet incident https://medium.com/prysmatic-labs/eth2-medalla-testnet-incident-f7fbc3cc934a