Let’s nuke proof Ethereum [1/2]
“There were millions of Gros Michel bananas across multiple countries. What could possibly go wrong?” — people before 1950
I promise, Ethereum will come later… But first, here’s a story about banana… Once upon a time, there was Gros Michel— a big, beautiful, tasty banana better than any other banana. Naturally, it dominated the market. While other strains also existed, they were present to a significant lesser extent. Bananas are generally cultivated in tropical countries and are an integral part of their culture.
Bananas are not the most resilient plants; storms, floods, droughts, etc., can cause the plantations to fail. #rekt. However, for us consumers, we never feel too much of this effect apart from some normal price movements. Why? Given the scale of banana cultivation, most of the disasters were relatively small. Bad management leads to a plantation going down? No one feels it. Bad weather causes a country’s worth of bananas to go down? Oh, this is not cool but probably still okay.
So, what could possibly go wrong? Well, around 1950, the Panama disease, a devastating fungal infection, wreaked havoc on banana plantations. (Full disclosure: I’m not a botanist.) But TL;DR, it was particularly adept at ruining the day for Gros Michel bananas and proved notoriously difficult to eradicate. As a result, by 2024, most of us are munching on Cavendish bananas, which, while resilient, don’t quite have the same allure as their predecessor.
Finally, we’re back to Ethereum! Now, we have close to a million validator nodes. But how resilient is our world computer? Are there ways we can reason about it more formally?
Let’s say we have 1M validators and we want at least 66% online at all times. Can you imagine 334K nodes going down simultaneously? Of course, there’s a certain chance that a node will go down, e.g., power outage, network issues, failure to pay bills, forgetting to update, or the owner just turning it off. But 334K at the same time?!? If you view it from an individual validator’s perspective, there’s only a small chance. And the math, which I won’t bore you with, says that if you need a lot of small chances to happen simultaneously, it becomes even less likely. (Spoiler: The universe will probably die before 334K validators randomly fail at the same time.) We’re safe! Yay! 🎉
Actually no, the devil is in the details. Like a banana tree (here it comes), there’s a little chance that one banana tree will die, but there’s a significant chance that another banana tree will die if the one next to it dies. Why? Because there’s something more systemic going on behind it. Realistically, the way that a tree dies, e.g., flood, bugs, fungus, drought, tornado, or some hungry elephant, is not random and not isolated to a single tree. This is the same with Ethereum validators. There are systemic risks that will affect a significant number of validators. The biggest one on top of my mind is GETH, the software adopted by more than 60% of validators (estimated). If it has a bug, it will probably affect all of them. There could also be some smaller events like someone nuking the US East Coast, some government convincing a cloud provider (forcefully) to ban validators, etc., which, while unlikely, are not impossible. I’ll leave the rest to your imagination. My point is it is not impossible. All these mini-doomsday events in aggregate still represent a risk that we should seriously consider.
The real question is how can we deal with these infinite possible problems? Good news! This is not the problem that we have to deal with. Why? In the end, all of these infinite problems will eventually exploit some concentration point in the validator set. There are three aspects that will cover most issues: software-related, physical-infrastructure-related, and control-related. If we have more software, it is less likely to have the same bug. If we diversify physical-infrastructure, a nuke or any physical problem will affect fewer validators. And… if we all decide not to take too much control of it, yeah, no one will control too much.
So, where are we right now ? What can we do ? Stay tune for the next part sir. 😊
Edited by Megan Khunakridatikarn, SCB 10X Lab