To Clear Things Up: CBC Casper Liveness Issue

Published in

Pyrofex

6 min readMar 19, 2019

In December 2018, we published a note on liveness issues with Ethereum’s CBC Casper consensus algorithm. What followed was a brief Twitter exchange between us, Vitalik Buterin, and Vlad Zamfir. While the results of the conversation are crystal clear to us, we realize that they are probably not so clear to non-experts in the field. We’d like to clarify what exactly our note on CBC Casper liveness implies for the protocol itself, and for platforms built on top of it.

Casper and Liveness

First, let’s look at our writeup on Casper liveness. All consensus algorithms — including Nakamoto’s protocol, Casper, and Casanova — make assumptions about the physical network they run on. In the field, this is referred to as the network model. The network model is crucial to the way that you reason about the algorithm. You can think of it as the universe that the algorithm is supposed to live and operate in. It tells you what’s possible, how fast messages can move, and how miners or validators send messages to each other, how adversaries act, what kinds of failures are possible, and so forth.

Some network models are better than others, for various reasons. They might be more realistic, more practical, or more adaptable than others. They might allow for a stronger adversary, for massive network failures and delays, or for colluding parties in the network that try to muck everything up. Some models are cheaper to build because they allow messages to be dropped, duplicated, or reordered. Some even allow for an adversary to have complete control of message order and delivery on the entire network.

We recently wrote a paper titled Establishing Standards for Consensus on Blockchains (to appear in ICBC 2019) about these kinds of models. Our conclusion is that all things being equal, the fewer restrictions you make about the networks where your protocol will work correctly, the better. In an ideal setting, you should allow for a formidable adversary who knows how to make the worst scenarios happen and consistently does it. You should allow messages to be delivered at any time, in any order, and you should allow at least some of them to be dropped. The Casper team understands this idea and calls this principle the “ideal adversary.”

When we started to try to prove that Casper was live, we put it up to the test that all consensus algorithms should go through. Among other things, we assumed this strong, malicious adversary, we assumed that the network could have weird delays, and we assumed that messages could be delivered at any time, in any order, controlled by the adversary. We didn’t do this because we wanted to be harsh or unrealistic; on the contrary, we did it because we know that only the highest-quality protocols can withstand such conditions.

What we found was surprising, to say the least. Not only was Casper not live under our harsh conditions, but we could construct situations in almost optimal conditions where Casper would stall indefinitely and never come to consensus on a block at hand. Under almost optimal conditions that are better than you realistically obtain on real-world networks, Casper stalls and fails to make progress. That’s what we wrote about, and that’s what the example and the picture refer to.

With that introduction, we want to explain what we, Vitalik, and Vlad said to each other during this exchange. We should make a note here that neither Vitalik nor Vlad are mathematicians, despite being active researchers in crypto. It makes perfect sense that their intuition for rigorous thinking is not as finely-tuned as someone who bases their entire career on rigor. We also make the note that none of RChain, Casper labs, and Ethereum have resolved the issue we brought up in this note.

In response to our article, the following exchange ensued:

Let’s break it down.

Vitalik translated

Vitalik is referencing the ideal conditions that we set up for Casper to fail. We modeled the network on messages coming in “timeslots.” You can think of brief slices of time, broken up so that and we may talk about the state of the network in terms of its evolution from slot to slot.

In the case of our example, the adversary only has to delay messages slightly, by one timeslot or “moment,” in order to get the network to indefinitely argue about a single block, never moving forward.

Vitalik claims here is that liveness requires lower message latency than one timeslot, or step, in the algorithm. Vitalik’s claim either means that the algorithm will be slow, or that the network will run at ludicrous speed. If we want our transactions to finalize quickly over the public Internet, then neither of these is realistic.

In practice, this means that messages must arrive well before miners, or validators, make new blocks. In other words, blocks should be produced at a sufficiently slow rate so that all messages can always arrive on-time. That way, when we model the network, messages would be delivered in less than one timeslot, as Vitalik recommended.

This isn’t so bad. It happens in Bitcoin, Ethereum, and most other consensus algorithms. But there are two deep problems that affect reliability and scale. The first is that an adversary controlling the network could slow the message rate down substantially, to be slower than block production, and starve the network. One of the things our engineers learned while working on large scale systems at Google is that at scale, even improbable failures happen all the time. Say what you like about such failures being too unlikely to worry about, it’s a glaring hole and it’s unsettling to a mathematician such as myself and to our best systems engineers.

The second, much more important, problem is that this inherently limits the protocol’s scalability. If it is true that Casper is live if we slow block production enough, then this requirement makes it inherently unscalable, probably only marginally better than Ethereum is currently.

Pyrofex translated

Our response cites an academic paper by Dolev, Dwork, and Stockmeyer about consensus algorithms. Essentially what it says it that it’s not necessary to have this condition that message speed needs to be much faster than block production rate in order to prove liveness. In other words, it’s totally possible for an algorithm to have messages and blocks go at whatever rate you like, and still get liveness.

Whether Vitalik meant that any algorithm needs this message/block rate condition to be live or that Casper needs it, it makes no difference. If he meant the former, we showed him otherwise and he is simply wrong. If he meant the latter, then we’re showing that Casper is not very good compared to other possible choices.

In particular, because Casper requires this condition to resolve our issue, it’s inherently unscalable. This is mathematically proved. Remember that the race in the blockchain space is to be as good as or better than Visa transaction rates. Casper doesn’t even stand a chance.

Casanova, on the other hand, doesn’t need Vitalik’s condition. The C∆ blockchain we are building on top of Casanova, will scale just fine. In contrast with Casper, we have proofs of both safety and liveness, even in the presence of a maximally powerful adversary and with a deeply unreliable network.

Vlad translated

Vlad seems to suggest that, baked into the CBC Casper spec, there’s some validator strategy for Casper that makes everything okay. That is, some mechanism for deciding when to send messages, create blocks, and so forth will, if implemented by the nodes, make everything work again. This claim is truly mystifying because it’s so antithetical to the spirit of blockchain. If I require that validators implement this strategy, am I not placing significant trust in their good behavior? How will the network detect bad behavior? He doesn’t say.

Moreover, CBC Casper does not provide such a strategy. Mathematically, you’re allowed to assume any strategy you like, which we did.

Our assumed model is not only common but perfectly reasonable. Our model assumes that validators produce blocks at fixed intervals and that they populate those blocks with the information they had at the time of production. Producing blocks too fast violates Vitalik’s constraint, so Vlad’s validators can’t be allowed to do that. Producing them too slowly means they lose out on potential fees and rewards, so validators are disincentivized to do that. What “strategy” exactly does Vlad think we “should” have chosen? We can’t tell, but we don’t think there is one.

Conclusion

To wrap this conversation up,

Our note still shows an important liveness flaw that, at the very least, shows Casper is either slow or won’t scale.
Vitalik’s response displays Casper’s inferiority.

Casanova will be one of the first provably safe, live, and correct consensus algorithms that will operate at global transaction scale. We’ve designed C∆ to complement and augment Casanova’s inherent strengths. It will support the first cryptocurrency to truly scale and bring the benefits of peer-to-peer transactions to businesses and individuals globally.