The eth2 public testnet, Medalla, spiraled into a series of cascading failures this past weekend which exposed several vulnerabilities and process faults in how to best handle critical scenarios. Starting with the receipt of bad responses from 6 different time servers which threw off most nodes running our Prysm client at the same time, our team rushed to push a fix to the problem. This fix contained a critical flaw which removed all necessary features for our nodes to function. This problem led to network partitions, with everyone synchronizing the chain at the same time but unable to find a healthy peer, Medalla had a very eventful weekend and offered us the greatest learning experience to prevent this happening again, especially on mainnet. This post will cover the full summary of the incident, its consequences, lessons learned, and concrete action plans moving forward before a mainnet launch for eth2.
Did the eth2 testnet fail because of relying on cloudflare.com’s time servers? Does eth2 need to rely on a single point of failure for timestamping?
No. We were leveraging roughtime cloud servers to give users feedback that their system time might be off, and dynamically adjusting their time based on the responses of these servers as a nice thing to do, but this was not necessary at all and instead was problematic. It is indeed a security risk if we are relying on a single point of failure for something as important as timestamping in eth2, and it is unnecessary. Starting from this incident, we will rely on system time only. If a validator’s time is indeed off, we can tell them, but will not forcefully change it. Other eth2 client implementations only use system time and we will too.
Is the testnet dead?
No. As long as it is possible to run a node and as long as validators can validate, the testnet can always go back to being fully operational. At the moment, client implementation teams have hardened their chain sync code to allow for smooth running of nodes, which will boost validator participation and allow the chain to finalize again. We still have hope. Participation has now climbed from 0–5% to 40%. The chain needs > 66% to finalize.
How does this affect the mainnet release? Is there a new delay?
We believe this incident does not inherently affect the launch date. The Prysmatic Labs team recommends ETH2 launch schedule to continue with no delay. The incident from this weekend was a good stress test for many clients and actually checks off a few requirements on the launch checklist. While the launch date has not been set, we believe the expected launch target of 2 to 3 months from Medalla genesis is still an ideal timeline. There will be a public checklist of requirements for an eth2 launch, and this Medalla incident will definitely add a lot of new items to the list regarding client resilience, security, and proper release. That’s as much information as we have today.
Timeline of events
First signs of trouble
Almost immediately after the incident started, users noticed their Prysm node reporting their clock was off and they were seeing blocks from the future. At the time, Prysm was using the roughtime protocol to adjust the client clock automatically by querying a set of roughtime servers and determining the appropriate clock offset. The roughtime protocol works by querying a set of servers in sequence with a chain of signed requests/responses and the client would take the average of these responses to adjust their clock.
Notice something off about this? The 6th server in the list is reporting a time which is 24 hours further than the other 5 servers. When the roughtime client took the average of these results, it thought that the correct time was Fri, 14 Aug 2020 22:20:23 GMT. One of the key components of the roughtime protocol is accountability. Given that there is a signed “clockchain” of sorts, we could prove that the ticktock server misbehaved and reported the issue effectively. Unfortunately, Prysm did not log the signed chain and we do not have it stored anywhere from the time of the event. After a quick look at the logs, TickTock was off by 24 hours which would explain why we had a 4 hour offset with 6 servers. We got in contact with the maintainers of the roughtime project from cloudflare and they have since determined how to make their client more robust in case of server faults:
Client robustness to misbehaving server · Issue #22 · cloudflare/roughtime
Right now if a server misbehaves it can lead to the client averaging in the cheating offset, leading to high offsets…
Midpoints (timestamp) returned around the time of the incident
Emergency fix update (alpha.21)
Although we didn’t know that one of the roughtime servers was reporting 24 hours into the future, we knew that something was wrong with roughtime and that it needed to be disabled immediately. Rather than deleting the roughtime code entirely, we modified it to require a runtime flag to adjust the clock instead of adjusting it automatically by default. The network was in a rough state and we wanted to act fast. We decided to push an “emergency release” and ask everyone to update to the new code immediately. However, right before we did this, the roughtime servers recovered.
The mass slashing event
We should have seen this coming in hindsight, but it still shocked everyone as it happened. At approximately 2AM UTC, every validator that was active during the roughtime incident was now attempting to get slashed! It became quite the carnage, with over 3000 slashing events broadcast during a short amount of time and all of our internal validators slashed. We had not configured local slashing protection in-time for our own internal validators while we were really busy with improving the user experience for the testnet users.
Slashing protection in action
Fortunately, Prysm nodes ship by default with a simple slashing protection mechanism that keeps track of attestations and blocks validators produce to prevent them from signing the same messages again. For many users, this saved them from catastrophe!
The real problem: bug discovered in alpha.21
Worried by the urgency of the original problem, we didn’t think too much about all the implications of a potential fix, and focused more on quickly releasing it than checking carefully if it would break anything else in our nodes. Our teammate Nishant was the first to point that our release number alpha.21 was critically flawed. The network could have recovered on its own if we had not acted at all.
In releasing this fix, we accidentally removed the initialization of all critical features for our eth2 beacon node to function, making the problem infinitely worse. After announcing the release to everyone in our discord server and on twitter, stakers started to quickly update their nodes, which is when we realized just how screwed up this was. Even worse, the roughtime bug had fully recovered by now, which would have likely fixed the issues in the network had we not acted so swiftly.
Rollback and syncing troubles
After realizing the scope of the mistake, we immediately recommended users to roll back to a previous release, now that the roughtime issue had been resolved. This ended up being a really rough move, as the network had become so partitioned, with users confused about why there were so many updates in a short span of time. As a consequence of most nodes being down for a while, and users restarting their nodes too often to fetch updates, it seems almost everyone in the network was trying to sync with the chain, making it impossible to reach the chain head. Moreover, it became really difficult for nodes to resolve forks. That is, with so many bad peers in the network, good peers were needles in a haystack. Moreover, resource consumption for nodes was climbing through the roof. Other client implementations were seeing massive memory consumption and Prysm nodes were also suffering from significant CPU usage, which didn’t help when trying to resolve forks.
The incident exposed several key, flawed assumptions our node was making in terms of handling forked blocks in the event of a highly-partitioned chain. We were not handling several code paths where there could be multiple blocks in a certain slot time, which was causing our nodes to often get stuck. Moreover, even after we resolved these issues, nodes could not resync to the chain head if they had fallen behind. It is easy to write code that assumes chain stability, but having it function equally in times with many bad peers, forks, network partitions is another beast altogether. Our sync logic was robust enough to handle these scenarios through a change in assumptions, which our teammate Victor Farazdagi was able to resolve quickly. We have since then pushed a fix that has led Prysm nodes to sync to chain head and remain in sync! At the time of writing, most nodes are updating to this version and at this point, we just need more validators to come online and start attesting to and proposing blocks with their synced beacon nodes. For next steps, we are monitoring chain participation and getting in touch with as many individuals as possible who run validators to understand if they still run into issues.
If you are running Prysm, you can download the latest version from our releases page https://github.com/prysmaticlabs/prysm/releases or follow our detailed instructions from our documentation portal here https://docs.prylabs.network/docs/install/install-with-script. We will keep updating on the status via our Discord as the situation progresses.
Don’t rush to merge in fixes
This entire incident could have been avoided if we did not rush to fix the roughtime bug. The reason we got into this state in the first place was due to a faulty pull request we merged which reverted all critical features needed for our nodes to function. Worried by the urgency of the original problem, we didn’t think too much about all the implications of a potential fix, and focused more on quickly releasing it than checking carefully if it would break anything else in our nodes. Our teammate Nishant was the first to point that our release number alpha.21 was critically flawed. The network could have recovered on its own if we had not acted at all.
Due to a sense of urgency of seeing all our client’s nodes in the network suffering, we wanted to ease users’ concerns as fast as possible. Although the fix was originally created by an outside contributor, it was our fault that it was not reviewed with utmost care. There was a single line of code that unset all global configurations for our nodes. https://github.com/prysmaticlabs/prysm/pull/6898/files#diff-fb86a5d3c2b85d3e68cad741d5957c29L263. Moving forward, every release candidate that is done in the middle of a difficult situation or a crisis needs to be
- Reviewed by the entire team + someone external to the team such as an Ethereum researcher
- Needs to be tested in a staging environment for a certain period of time, either the Prysm eth2 attack net or a local testnet that contains the same bug as users are experiencing
Our team typically uses the practice of canary deployments which run newly merged pull requests side-by-side with production deployments to understand how they perform relative to a baseline over a period of time. However, given every node was unhealthy, including those we run internally, there was no way to run a canary against a baseline. We rushed to fix the issue and publicized the release to users as soon as we had it, not realizing it was completely broken. This will not happen moving forward, and we have learned the costly lesson of this. Despite validator balances decreasing and the chain not finalizing, we need 100% confidence in fixes released during such tumultuous periods.
Careful external communication regarding updating nodes in periods of instability is critical
Another mistake that occurred during the incident was our external communication regarding updating nodes. Given 90% or more nodes were having critical issues, and were far away from being synced to the head of the chain, telling everyone “hey, quickly update your nodes!” led to absolute chaos. Nodes also have a default max peers cap of 30, which for people that did not know how to amend this cap would mean almost all their peers were bad or probably also trying to sync. Not everyone has notifications enabled for our discord announcements, and people on different time zones may have been away during the time we asked everyone to update. Having clear communication, understanding the state of the network and the implications of asking everyone to update their nodes, are among the main lessons we learned from this incident.
Make migrations to other eth2 clients seamless and well-documented for users
One of eth2’s main talking points is its decentralized development, with 5 different, independent teams working on building client implementations of the protocol. In the Medalla public testnet, we had 5 clients participating at genesis. Although not every client had the same readiness status, many had improved significantly since the last testnet experiments. At the time of the incident, over 65% of the network was running our Prysm client, which contributed to the network catastrophe once all Prysm nodes went down. The general idea of network resilience is to be able to easily switch between clients in the event of a single client having a critical bug. Unfortunately, the release of the Medalla testnet coincided with teams working on standardizing how their clients manage validator keys, which means we all weren’t 100% prepared on documentation for migrating between Prysm and Lighthouse, for example. This is a high-priority action item moving forward, and something we’ll be adopting into our public documentation portal. Users should be able to easily switch clients whenever they wish, while abiding by security best practices which we need to publicly announce.
Important takeaways for stakers
This was the best thing to happen to a testnet
It would have been really terrifying if the Medalla public testnet ran uninterrupted, with perfect performance right before mainnet, and then this bug occurred with real money at stake once eth2 launched. In terms of worst-case scenarios for a blockchain, having a client running the majority of the chain contain a bug which makes all nodes go offline is indeed a nightmare that has manifested itself in Medalla. Knowing what to do in this situation and being equipped to migrate to a different client if needed, is really important for stakers participating.
The risk of eth2 phase 0
Eth2 phase 0 is a highly ambitious project that poses technical risks for those joining at its genesis time. It is a complete revamp of the Ethereum protocol, introducing Proof of Stake, which will make it debut for the first time following the Casper FFG (friendly-finality gadget) research that has been ongoing for several years. Although the client implementations have matured significantly and will come out of this incident much stronger than before, eth2 is an experiment that can have serious consequences for those joining and not understanding the risks. Had this happened in mainnet, with several days without finality, people would lose millions if not tens of millions of dollars in collective penalties, which would be extremely painful for all including the most ardent supporters. We want to make it clear that these risks are very real, and technical risk is impossible to eliminate. The Eth2 launchpad contains a section on the very real risks of early adoption, and we encourage you to think carefully whether or not staking on eth2 is right for you.
The risk of client dominance
At the time of writing, our Prysm client runs around > 78% of publicly accessible eth2 beacon nodes
Before the incident, the number was around 62%, which is still unreasonably high. Eth2 development is focused on multiple, independent implementations of the protocol that allows users to switch between them in times of crisis or when a bug is found in a single implementation. The current issues in the Medalla testnet are a consequence of having a single client disproportionately run the network, and we believe this is something that should change. Other client teams are making strides in updating their documentation, improving their resilience, and making it easier for people to run them. A lot of stakers have picked our client because it has been easy to set up for them in their personal or cloud setups, and we are humbled by this. However, it is good practice to try other implementations and experiment switching between them as needed, especially for those stakers operating many validators in situations like these.
How to keep your nodes updated
Convincing people to update their nodes is the eternal bane of client implementation teams. As networks become more and more decentralized, there isn’t a single source of communication where we can notify all node operators to update their nodes to fix critical bugs. The best teams can do is rely on their own forms of communication, such as their Discord servers, Github release pages, or even Twitter accounts. Despite telling everyone “hey, update your nodes!”, operators might be in different time zones and therefore offline, or simply might not use social media. Relying on everyone updating at the same time can spell disaster, because if nodes are offline for a long time, and then everyone tries to update, so many peers in the network will be trying to sync at the same time and would flood the network with bad peers. Having a more detailed strategy towards releases, ensuring “official” channels are known to people, is critical. Moreover, knowing that large holders of ETH and people running many nodes can have easy access towards communicating with our dev team is important. We want to work more closely with operators to ensure they know when they should update and discuss with them the changelogs that are relevant to them in every update.
What if this happens in mainnet?
Needless to say, this sort of situation is one of the worst-case scenarios for mainnet. Although these series of events were the best thing to happen to the testnet, as they give us a taste for how to resolve a network catastrophe, this cannot happen the way it did when there is real money at stake. Even though you can have security audits, careful code reviews, and staging, the reality is there will be attacks on the network in the same way eth1 was attacked and DDoS’d many years ago, and this is something we need to prepare for. The folks behind eth1 are battle-hardened and have accumulated so much knowledge regarding appropriate responses to catastrophes. This testnet gave us the following lesson and requirement for client teams if we want to deploy to mainnet:
- Have checklists for everything, including release candidates, staging, external communications, monitoring
- Have clear instructions for users to migrate between eth2 clients as needed
- Have a step-by-step guidebook followed by an eth2 client response squad. Prysmatic Labs has its own internal playbook, but coordinating a central one between eth2 clients would ease a lot of concerns
- Have a detailed plan for communicating with stakeholders, node operators, and regular users regarding updating their nodes in critical periods
This scenario truly changed how we approach eth2 development from now on. We have always understood the high stakes of this project, but are now much more equipped to understand how to react in times of crisis, how to keep our cool, and what NOT to do when many are relying on our client to stake.
In conclusion, the Medalla eth2 testnet suffered from cascading failures due to bad decisions on our part regarding handling response fixes to a problem affecting many nodes at once. The testnet did not get into this state only due to roughtime, or due to a central point of failure, but rather through a series of events that culminated in various network partitions. This is the best possible thing that could have happened in a testnet, and all eth2 client teams will now be extremely prepared to avoid any scenario of this kind in mainnet. We will focus on more process, security, and appropriate responses to improve the resilience of eth2 by working together with all the client implementation teams towards these goals.