This article is the first in a series of articles exploring consensus and finality in Polkadot.
Anyone spending enough time studying blockchains has run into the concepts of consensus, finality, and perhaps even “safety.” These concepts often represent a high barrier to entry when it comes to understanding how blockchains really work. So let’s start with a primer on consensus.
One common misconception about blockchain consensus is that it’s about producing a valid history of transactions. In fact, anybody can make a sequence of transactions on their own computer which is valid. The hard part is getting everyone to agree on which valid history of transactions to use. This is where consensus comes in.
Classical consensus algorithms provide two main properties: safety and liveness. Safety is the property that ensures honest participants don’t agree on two conflicting things (blocks, in this case). Liveness is the property that ensures something is eventually agreed upon by honest participants. In the context of blockchains, safety is the same thing as finality. That is, once a block is finalized, the canonical chain will always contain that block in the future. In blockchains, liveness means that the chain keeps growing and, furthermore, that valid transactions will eventually be included and finalized.
All consensus algorithms provide these properties of safety and liveness only under certain assumptions. These assumptions are usually phrased in terms of what proportion of the actors in the system are Byzantine — what proportion of the actors can be offline or even behave outright maliciously. There is also the network assumption: how long it takes to deliver messages, whether messages can be received out of order or even dropped. Network assumptions are generally split into these categories:
- Synchronous: Messages are always delivered within some time T of being sent, which all participants are aware of.
- Partially-Synchronous: There is some known Global Stabilization Time (GST), before which messages may be delayed arbitrarily, and after which messages are delivered within T.
- Asynchronous: Messages are eventually delivered.
It’s important to note that messages may be delivered in different orders to different participants. These network assumptions also determine what the maximum possible proportion of byzantine participants can be. When the network is partially-synchronous or asynchronous, it must be less than ⅓. To show why this is, imagine that ⅓ of nodes were malicious and that it takes a supermajority (>= ⅔) to reach consensus. Then the malicious nodes could split the remaining ⅔ of nodes into two groups: A and B. The malicious nodes could then participate in two different supermajority groups, with both A and B, leading to a safety violation. This is only possible when communication between A and B is delayed longer than the time it takes to reach consensus, which can be the case in the partially-synchronous and asynchronous network models.
When applied to blockchains, note that block safety is not always reached instantly. For example, in Proof-of-Work, or Nakamoto consensus, safety is technically never reached. Instead, the safety reached is probabilistic, and argued through a “common-prefix” property, which states that if each honest party ignores the last k blocks from their chain, the probability that they all agree on their chain is overwhelming (with sufficiently high choice of k). Proof-of-Work and other probabilistic safety protocols also have the caveat that they do not provide asynchronous safety, which means that sufficiently large network partitions or attacks can lead to chaotic failure.
In a public blockchain network, we require an economic incentive for participants not to break the safety property and revert blocks which have been agreed upon. As opposed to many consensus algorithms which lean on a simple honest majority assumption, we actually need something stronger: accountable safety. Consider a consensus protocol which is safe unless ⅓ or more of participants misbehave. As long as fewer than that amount of participants misbehave, then everything is fine. But if they do, we want to be able to prove that they did. That’s accountable safety: participants in consensus can be held accountable (i.e. have some security deposit burned) if they misbehave. Under accountable safety, we get a stronger definition of finality: a block cannot be reverted unless at least ⅓ of stake in the system is burned.
Typically, consensus protocols which provide absolute (non-probabilistic) safety require a lot of messages to be passed around for everything they agree upon. Often, the number of messages that needs to be sent between participants scales badly with the number of participants in the consensus protocol. This makes it expensive to perform this kind of agreement on every block, as is done in the instant-finality family of protocols. For this reason, it’s often beneficial to separate the creation of blocks from the finalization of them. This approach leads to a two-prong approach to blockchain consensus: a chain growth system and the finality gadget. Having the finality gadget finalize many blocks at once is beneficial for efficiency, because it reduces the number of messages that must be passed between consensus participants on the network. This hybrid model makes it possible to grow the chain just as fast as in probabilistic safety consensus like Ouroboros or Aurand, but also provide the same level of security guarantees that instant-finality consensus does. This is why we are using this kind of hybrid consensus system in Polkadot.
Since a finality gadget does not produce blocks, we need another way to think about liveness, which we defined before as the property that something is output by the consensus process. Instead, we phrase the liveness property of the finality gadget as being dependent on some properties of the block production layer. That means that as long as the method used to add blocks to the chain behaves a certain way, the finality gadget will always have something new to finalize. This is a very strong property, because it means that we will always be able to add irreversible transactions to the chain.
Polkadot uses a consensus model based on the description above, a hybrid model that separates block production from finality on those blocks. Our goals are for the block production layer to be fast and probabilistically safe. The finality gadget should be asynchronously safe, should provide accountable safety, and should be able to finalize many blocks at once. We have invented GRANDPA (GHOST-based Recursive Ancestor Deriving Prefix Agreement), a finality gadget which provides these things and more.
The key insight behind GRANDPA is the idea of incorporating the blockchain’s structure into the consensus algorithm. One piece of intuition is that considering a block valid implies considering that block’s parent valid, and so on. Here is a simplified explanation of the algorithm: rather than voting on a single block, we allow participants to vote on the highest block they think is valid, and the algorithm transitively applies the vote to all ancestors of that block. The algorithm then determines the best block which has a >⅔ supermajority amount of votes, and produces a proof-of-finality. The proof-of-finality is constructed by taking the supermajority votes, and bundling them up together into a single message. Signature aggregation can be used to make this smaller.
Because of this strategy of finding a highest common ancestor which a supermajority considers valid, GRANDPA is also adaptive: it can finalize a new block regardless of how many blocks have passed since the last block was finalized. If network latencies are low, GRANDPA can finalize blocks almost instantly, and when recovering from a long network partition, GRANDPA can finalize millions of blocks at once without any message overhead. The operation of finding a common prefix of the blockchain which all participants can agree upon is transformed into a voting procedure which quickly finds the common prefix.
Voting in GRANDPA happens off-chain, and finality is not registered on-chain. However, GRANDPA participants are staked on-chain, and conflicting proofs of finality will lead to “equivocating” or double-voting participants to be discovered and punished. The procedure for discovering which participants equivocated will be described in a future post, and consists of a challenge-response protocol off-chain, which succeeds as long as at most ⅔ of participants are malicious. GRANDPA is only secure as long as historic participants are on the hook for their punishment, and thus falls into the “weak subjectivity” model of security, where participants have to go through a fairly long withdrawal period (perhaps a few months) in order to unlock their stake and reap their rewards. As long as users of the system update periodically to the latest set of consensus participants, they will remain secure.
This is just a teaser of GRANDPA. The next articles will, among other things, dive into its inner workings, how GRANDPA provides accountable safety, liveness, deals with dynamic validator sets, benchmarking performance, and more. The first implementation is underway in Rust at https://github.com/paritytech/finality-afg, and a document describing the gadget in detail is here: https://github.com/w3f/consensus/blob/master/pdf/grandpa.pdf
Stay tuned for more information!