The origin of blockchain technology is usually simply attributed to the actions of some mysterious genius working under the pseudonym of Satoshi Nakamoto. This view of history usually makes the technology seem almost magical and implicitly hard to understand. It is a lot more grounding when one understands the actual history behind the development of these systems. In this section, I will outline the most important topics from the decades of research in consensus algorithms and distributed systems that led up to the creation of blockchains.
What is a Distributed System?
Leslie Lamport, one of the fathers of the field of distributed systems, once famously said,
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
This quote encapsulates why it is so difficult to design distributed systems. Each node in the system is part of the larger network and relies on communicating with other nodes in order to get something done. The abundance of moving pieces makes these systems some of the hardest to develop. However, the benefits of being able to use the power of many computers linked together motivated decades of computer scientists to research ways to overcome these challenges.
It is a fact of life that computers are not perfect and are prone to breaking. However, these faulty machines may not always be tolerable for services that mandate high availability such as many of today’s internet companies. Imagine if Google was run on a single server that had to be kept up and running at all times. If that single server was faulty, then the entire service would be down. One of the most important motivations of distributed systems is to make a service more fault tolerant. That would mean the system itself can tolerate some of the nodes failing and continue the service it provides.
Nodes can be faulty at different severity levels. The failure models going from least to most severe are as follows:
1. Crash Failures — Nodes in the system can crash and stop responding forever.
2. Omission Failures — Nodes in the system can reply to any message in the system infinitely late. Like a crash but the node is still up and can possibly transmit future messages like normal and can pick and choose which messages it omits.
3. Byzantine Failures — Nodes can not only crash and stop responding, but can actively lie and send conflicting messages to different nodes. Nodes can also forge messages from a node to another node. This is a failure model where you can’t simply trust nodes to tell the truth.
It is very important to note these different failure models since not all protocols can solve all failure models. But any protocol that manages a more secure failure model automatically manages any lesser failure model.
Since all nodes in a distributed system must communicate over a network, the network characteristics are central to how the system performs. A few types of networks going from most reliable to least are as follows:
1. Synchronous — A network where there is a known upper bound on message delivery times between nodes (we are assuming message passing communication) and a known upper bound on differences between node speeds. Basically in a synchronous system, if a message doesn’t appear in time then you can assume the sender is experiencing a failure (crash, omission, or byzantine).
2. Partially Synchronous —A network that is a mix of synchronous and asynchronous. In this network there is an unknown bound on message delivery times and the difference between node speeds.
3. Asynchronous — A network where there is no upper bound on message delivery times between nodes and no upper bound on differences between node speeds.
If a protocol is rated to work in a less reliable network then it also works in a more reliable network. So if a protocol works in an asynchronous environment then it works in all of the environments which is why it’s seen as the golden standard of protocols. It should also be noted that it is much more difficult to solve for less reliable network models.
Most uninitiated people may believe the internet to be a relatively synchronous network to connect nodes. This assumption is in fact very false and nodes on the internet cannot guarantee any sorts of reliability when communicating. However, it should be noted that just because the internet cannot guarantee reliability doesn’t mean it cannot provide it sometimes.
State-machine replication (SMR) is a general framework for replicating a state across many replicas by making all the replicas go thru the same deterministic events. By running the same deterministic events on all servers, all the servers should arrive at the same state. This may look familiar to our current day blockchains and Leslie Lamport came up with the formulation all the way back in 1984.
One way of solving SMR is to achieve consensus among the nodes in the network.
Consensus is a topic that is pivotal in all distributed systems including blockchains. Consensus is the algorithm that describes how the individual nodes in the network come to a common state. This problem is actually more difficult than it might seem on the surface. Throughout the years of research, there have been few seminal proofs about consensus that frame how difficult the problem really is. The proofs will also help point out any potential “too good to be true” promises made by some dubious blockchain companies. While I won’t go through the specifics of each proof, I will try to do a high level explanation of what they entail.
The most important results are as follows:
FLP Impossibility (1985)
There is no deterministic protocol that solves consensus in a message-passing asynchronous system even in the presence of a single crash failure.
This means that a protocol must be either randomized or assume some sort of stronger network model to solve consensus in the presence of any sort of failure. This doesn’t address Byzantine failures which are even harder to solve. I highly encourage you to try reading this groundbreaking paper here.
CAP Theorem (1999)
It is impossible for a distributed system to provide more than 2 out of 3 of the following guarantees:
Consistency — Every read request gets the most up to date write
Availability — Every request gets a response of success/failure
Partition Tolerance — The network will be allowed to lose arbitrarily many messages sent from one node to another and still function properly
In today’s internet, Partition Tolerance is a necessity and cannot be given up since the network isn’t reliable so the compromise must be made between Consistency and Availability. So distributed systems must make deal with the trade-offs between Consistency and Availability. If a protocol is 100% Consistent then it definitively must not be Available at some point and vice versa. The proof behind this is quite elegant and understandable so I recommend you read more about it here.
Consensus is impossible to achieve if more than 1/3 of the nodes are suffering from a Byzantine fault in a synchronous network.
This means that to reach consensus, we need 2/3 of our nodes to be honest and functional. You can read more about the proof in the original paper here. Some of you may be confused by this proof since you may know that PoW blockchains are safe up to a 51% attack which is clearly more than 1/3. This is a result of the fact that PoW blockchains are not actually reaching consistency since they every transaction can technically be rolled back and changed by an adversary with infinite computational power.
These results will help us analyze all consensus techniques and also frame blockchains in a way that we can examine the guarantees and safeties they provide.
Traditional Consensus Algorithms
There are a few noteworthy traditional consensus algorithms that became widely used before the invention of blockchains.
The most widely used early consensus protocol was Paxos. Paxos was invented by Leslie Lamport and provided safety when at least 1/2 of the nodes are not faulty in the presence of crash failures and in an asynchronous network. The fact that Paxos only protects from crash failures is important because this requires all the nodes to be trusted (they shouldn’t be able to lie or they would be Byzantine) which was most useful to large companies trying to build high availability services.
The observant readers may be thinking I made a mistake since I said Paxos can withstand 1/2 crash failures in an asynchronous setting but had already previously said this was impossible in the FLP Impossibility. The way Paxos gets around FLP is by not being live at all times. Paxos requires a leader to be live and make progress and leader election is provably impossible in an asynchronous setting. So Paxos needs some window of synchrony in order to elect a leader and then it is ready to make progress in an asynchronous setting. In all of its execution, Paxos is always safe even in an asynchronous setting.
So Paxos is seemingly the best that can be done given the constraints of the FLP Impossibility and that is why it is probably the most notable consensus algorithm in distributed systems research. Paxos is always safe and only live during periods of synchrony.
Practical BFT (1999)
The first actually feasible Byzantine Fault Tolerant (BFT) algorithm was PBFT. It, similar to Paxos, is always safe even under an asynchronous network and only live under weak synchrony.
Byzantine Paxos (2010)
A BFT version of the original Paxos that is safe in the presence of 1/3 Byzantine faulty nodes under the same network assumptions. This was invented by (you probably guessed it) Leslie Lamport as an extension of his original Paxos paper.
A popular modern consensus algorithm seen as a simpler, more understandable version of the original Paxos. The guarantees and limitations it provides are said to be equivalent to Paxos while the whole protocol is easier for engineers to work with.
While there are plenty more notable advancements in the research of distributed systems, these protocols highlight the most popular and influential works of their time.
Now that we have a good base in distributed systems, we can start getting into blockchains!
Origin of Proof of Work (PoW)
A lot of people wrongly credit Satoshi Nakamoto with the creation of the PoW algorithm in Bitcoin but in reality it is a much older idea created in the mid-1990’s. It was used in production early on as an email anti-spamming algorithm by Hashcash. Hashcash required senders to complete PoW (very similar to the algorithm Bitcoin uses today) to send an email so that spammers have to waste a lot more resources to send many emails.
However, it is important to note that early use cases of PoW did not use it to solve consensus but instead used it to force people to provably spend a desired amount of resources to do an action.
Origin of Bitcoin
This is where the mythology of Bitcoin really takes over. As the legend goes, an unnamed person going under the pseudonym Satoshi Nakamoto releases a protocol for a decentralized digital peer-to-peer system for digital money in October 31st, 2008. I would urge you to read this short and brilliant paper. While the idea behind Bitcoin is revolutionary, it can now start to make more sense in the context of all the other systems presented rather than all on its own.
Bitcoin proposed a blockchain as a solution to probabilistic consensus in a synchronous network. It is BFT from Byzantine adversaries that have less than 50% of the of the total network hash rate (a different way to gain security than other historic protocols). Blockchains favor availability over consistency by design and can support thousands of decentralized nodes. The cost of the decentralization is a much lower throughput measured in transactions per second (TPS).
I will go into more detail about these guarantees and others in my next post!
Thank you for Reading!
If you have any questions/comments feel free to ask in the comments.
Please stay tuned for more posts as I dive even deeper into blockchains!