Dealing with failure in cryptocurrency

There are many ways that a blockchain can fail. It’s always good to think about what failure looks like because it lets us see 1) how bad it might be, and 2) how it’s possible to recover from failure and 3) how can failure be prevented.

Knowing how things can fail and what it is possible to do to recover puts us in a position where we can keep our cool — both during failure and while contemplating the possibility of failure. I feel like this might be important generally as a life lesson, but here is a more economic example: If you’re buying insurance (which mitigates failure, in a way) or paying for security (which prevents it), you may be willing to pay more if you have more fear of the outcome being insured against (or supposedly being prevented).

There are (at least) two classes of failures in cryptocurrency:

  1. Governance failures
  2. Technical failures

I’m going to talk about the “governance layer” as a kind of a catch-all term for anything that may influence how the protocol changes over time. It’s also the governance layer’s job to make sure the assumptions required to show (in theory) that a protocol behaves as advertised are actually true (irl).

This blog post is mostly about dealing with a technical failure in a blockchain by using the governance layer. It (maybe) can achieve consensus on a soft fork/hard fork that can be used to recover from a technical failure. Governance failures can be pretty bad because governance may be required to recover smoothly from technical failure.

As a prescription for governance failures, I suggest participation by members of the community in the governance process (just fix it), or a splitting from the community (give up and get a fresh start). Or you might want to do something else while you wait it out. You might make this decision as an individual, or as a coalition/cartel. But governance in practice is (maybe?) a technosociopolicial problem. It’s not necessarily easy!


How can a blockchain fail?

The blockchain can experience many kinds of technical failures. The failures in this list are not necessarily exhaustive:

  1. Reversion of blocks expected to be consensus (safety failure)
  2. Consensus on invalid blocks (safety failure)
  3. Unavailability of consensus on new blocks (liveness failure)
  4. Censorship of transactions/blocks (liveness failure — selective use of 3.)
  5. Block unavailability (liveness failure)

I’m going to go over each of these failures one at a time, and explain what it is, how to recover from it, and how to prevent it for a proof-of-work blockchain architecture. There are going to be similarities and differences in proof-of-stake, but I want to keep it short.

Technical note: for 1) to be general, it should be written/understood as something closer to “if two nodes have distinct competing blocks and are each confident that their block is consensus, then there is a consensus safety failure. We additionally will allow these two nodes to be the same node at two (not necessarily immediately) consecutive states of the protocol”. A definition fitting this description should (I’m guessing) be able to capture consensus safety in both proof-of-work and proof-of-stake blockchain protocols, and also in traditional consensus protocols.

Reversion of consensus blocks (51% double-spend)

What is it?

In proof-of-work, reversion of blocks is the classic “51% double-spend attack”.

Imagine: After waiting for more than 6 (or 200) confirmations, your confirmed transaction becomes unconfirmed.

This is normally understood to be the result of an attack by an adversary with a majority of the hashrate* because the only way to reverse a block in proof-of-work is to present a heavier chain without that block.

*Technical note: It can also occur due to network asynchrony.

<insert picture of 51% attack>

How does a blockchain recover from that?

In a proof-of-work blockchain, the protocol (officially) is that nodes always choose the blockchain with the most proof-of-work (even if the choice reverts their transactions). The less official side of the story is that, light clients and even full nodes have checkpointing.

If a large number of blocks are reverted, then perhaps the damage would be high enough that it justifies attempting to recover the original blockchain’s transaction history.

If a small number of blocks are reverted, then perhaps the cost was not too high and the network won’t mind the reversion.

The simple proposal:
A hard fork can introduce a checkpoint in the blockchain above the block where the new (presumably attacking) heaviest chain forked the original history. This would let all clients who install this hard fork remain on the original chain.

This proposal can be hard to implement if the community cannot come to broad agreement about which fork appeared first.

The more ambitious proposal:
Add a “non-reversion rule” to the protocol, which has nodes go to the longest chain starting at the block 6 (200) confirmations from now. Now an attacker cannot ever cause any protocol-following nodes to revert more than 6 (200) blocks. This would provide a PoW blockchain with subjective finality.

Here, an attacker won’t be able to double spend, but if they can make sure that two nodes see two distinct 6 (200) block forks before seeing any other can make these clients permanently diverge. However, waiting for enough PoW confirmations is meant to prevent this problem — and as long as the network isn’t too asynchronous, it does work.

(Vitalik’s solution of having a subjective discount on a blockchain’s score is a more “continuous” or “smooth” alternative to the non-reversion rule given above.)

How do we prevent this from happening?

Preventing a reversion attack requires that the governance layer make sure that the longest chain keeps growing — that there is always more honest hashpower than malicious hashpower. Not even once for a few days should there be more malicious hashpower than honest power, to completely prevent 51% attacks!

Conclusion:

A reversion attack is easy to recover from if the community governing the blockchain can come to consensus on which chain came first. And it is easy to mitigate with non-reversion rules. 51% attacks are only very scary if you cannot rely on the governance layer to reverse (or mitigate) damages.

It seems like it would likely be more efficient to recover from a reversion attack than to prevent one, particularly in settings with well-financed adversaries and low governance costs.

Consensus on invalid blocks

I think these are going to get mostly shorter as we go!

What is it?

Consensus on invalid blocks is a failure mode where the heaviest fork contains invalid blocks.

How does a blockchain recover from that?

Protocol-following full nodes will reject forks containing invalid blocks, and will remain on a fork with valid blocks.

If the community uses enough full nodes to provide enough services, then there will be an economic incentive for the miners to mine on a chain that is accepted by full nodes. This incentive must be high enough for miners not to willingly and knowingly mine on an invalid blockchain. It is up to the governance layer to make sure that this is the case — if it fails, then we may see consensus on invalid blocks.

Light clients do not validate blocks, and so must be helped extra-protocol. This can be done with checkpoints. However, the ideal thing to is for the governance layer to cause miners to produce a valid longest chain.

How do we prevent this from happening?

The governance layer has to make sure that miners have enough incentive to mine only on valid blocks. The easiest way for it to do this is to orphan invalid blocks by using full nodes to discriminate between valid and invalid blocks (being orphaned is very expensive, for miners).

It may be realistically impossible to prevent 100% of miners from ever mining invalid blocks, but this might be a case where recovery due to good full node behaviour is as good as prevention.

Conclusion:

Consensus on invalid blocks must be prevented and recovered from by community’s use of full nodes which validate blocks. The full nodes will not synchronize with the invalid heaviest chain, and should provide sufficient incentive to miners to return to mining on a valid chain. The governance layer should do its best to keep light clients safe, during the whole process.

Unavailability of consensus on new blocks

What is it?

Unavailability of consensus on new blocks is a technical failure where consensus on new blocks does not form.

This can be because no new blocks are mined (difficulty too high, perhaps), or because no one can mine on top of the longest chain (network asynchrony, perhaps).

How does a blockchain recover from that?

Since proof-of-work cannot “get stuck” (you can always add a block to a chain, if you can mine it), it is difficult in practice for an attacker to make sure that nodes are seeing distinct heaviest chains. The solution here is simple:

The governance layer has to make sure that there is enough mining power mining on the same chain, and that the chain is being propagated to clients (that blocks are being found, and being propagated fast/well enough to allow them to be chained).

How do we prevent this from happening?

The governance layer has to make sure that miners are well connected to each other and to the network of full nodes to prevent the unavailability of new blocks, and also to make sure that the miners do not all fail en-mass, so that we may a growing blockchain.

Conclusion:

Proof-of-work blockchains have a relatively simple liveness story. As long as the longest chain is getting longer, and everyone eventually sees that it is the longest chain, then there is consensus on new blocks.

As with the “invalid blocks” failure, prevention and recovery mechanisms are the same: making sure that there is enough honest hashpower on a well-enough-connected network

Censorship of transactions/blocks

What is it?

In this attack, we can imagine that a majority coalition of miners agree to only mine on chains that either don’t include blacklisted transactions or only include whitelisted transactions. This means that blocks from miners who are not following this strategy are being orphaned, so there is a large incentive for non-censoring miners to join the censoring coalition.

How does a blockchain recover from that?

The first thing that needs to happen to reverse censorship is to recognize that censorship is occurring. This may be non-trivial if the mempool is full and the transactions suspected to be being censored have low fees. However, it should be possible to detect censorship of transactions with high fees (if the network isn’t too asynchronous).

After censorship is detected, the governance layer needs to decide on a course of action: ensure that a majority coalition which does not censor takes power, or do nothing and accept the censorship (wait it out, maybe).

If a majority coalition doesn’t censor, then miners who refuse to mine on blocks who don’t follow the censorship policy will have their blocks orphaned. If a majority does censor, then miners who are not following this strategy are having their blocks orphaned.

There are many ways that the governance layer can consider making sure that there is a non-censoring majority coalition. They can add honest hashpower to the network. They can persuade or bribe existing miners to stop censoring. They can change the hashing algorithm in a way that obsoletes ASICs, in an attempt to make it easier to dislodge the censoring cartel.

How do we prevent this from happening?

To prevent censorship, the governance layer needs to guarantee that no majority coalition will ever form and choose to censor transactions/blocks.

Conclusion:

Reversing censorship requires that the governance layer bust the censoring cartel. Preventing censorship requires that the governance layer prevent censoring cartels from forming. This can happen in multiple ways and is not guaranteed to be easy, but is the only way to prevent and recover from censorship attacks.

Block unavailability

What is it?

Block unavailability is a failure mode where the heaviest blockchain has blocks that are unavailable (impossible to (fully) download).

An unavailable block is scary because we don’t know whether or not an unavailable block is valid.

How does a blockchain recover from that?

Protocol-following full nodes will reject forks containing unavailable blocks, and will remain on a fork with available blocks.

If the community uses enough full nodes to provide enough services, then there will be an economic incentive for the miners to mine on a chain that is accepted by full nodes. This incentive must be high enough for miners not to willingly and knowingly mine on an unavailable blockchain. It is up to the governance layer to make sure that this is the case — if it fails, then we may see consensus on unavailable blocks.

Light clients do not ensure that blocks are available, and so must be helped extra-protocol. This can be done with checkpoints. However, the ideal thing to is for the governance layer to cause miners to produce a valid longest chain.

How do we prevent this from happening?

The governance layer has to make sure that miners have enough incentive to mine only on available blocks. The easiest way for it to do this is to orphan unavailable blocks.

Conclusion:

Consensus on unavailable blocks in a PoW blockchain must be prevented by the governance layer through the use of full nodes which validate the availability of blocks. The full nodes will not synchronize with the unavailable heaviest chain, and should provide sufficient incentive to miners to return to mining on an available chain. The governance layer should do its best to keep light clients safe, during the whole process.

By the way, this becomes more challenging + interesting in blockchain sharding.

Closing thoughts:

It is interesting to see that reversion of blocks (51% attack) has the most asymmetry between what it takes to recover from an attack, and what it takes to conduct an attack. It is easier to recover from a 51% attack than it is to prevent them from ever happening, if there is “extra-protocol finality” provided by the “governance layer”.

Of the other failure modes, two can be prevented/recovered from by enough use of full nodes (validity and unavailability). “Unavailability of consensus on new blocks” is hard to see how an attacker would pull off (in a well-enough-behaved network) or benefit from. While preventing and recovering from censorship is relatively very difficult (the “governance layer” has to make sure a majority of miners don’t collude to censor, even though it may be in their incentive to censor).

If we know we can rely on the governance layer to benefit from this kind of extra-protocol finality, then perhaps we shouldn’t pay miners so much that we can be convinced that 51% attacks never happen no matter what reasonable cost. The cost of a 51% attack to the community is mostly known in this case: it is the cost of making a governance decision on which chain came first, and the cost of disruption to business-as-usual until this decision is made. This is hopefully much less than the cost to the 51% attacker (although that can’t necessarily be guaranteed).

I hope blockchain communities will not be intimated by threats of 51% attacks, or by the uncertainty around what happens if there is a 51% attack. It is very possible to recover from 51% attacks. I think we could all be less impressed, in this context, about what security PoW mining provides against 51% attacks relative to a context where the governance layer absolutely cannot be counted on for extra-protocol finality.

I think miner entrenchment is big problem in the public blockchain space. As far as I’m concerned, our collective fear of 51% attacks and willingness to collectively pay ridiculous amounts of money for “security” in order to prevent attacks is a much bigger threat to the success of cryptocurrency than 51% attacks themselves actually happening. I think our fear is giving miners more clout in governance than they would have if we more informed about what failure in cryptocurrency looks like, how it can be recovered from, and how it can be prevented— I hope this blog helps!

Like what you read? Give Vlad Zamfir a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.