An Explanation of Data Guarantees
TLDR; The core components of the Klaytn blockchain performed exactly as they should without fail, and this is precisely why it had an outage.
Starting on November 13th 2021 the Klaytn network experienced a network outage for an extended period of time. If you are interested to see the technical cause and steps taken by the Klaytn team and community to resolve the outage you can find out more from our previous post detailed in the incident report.
However, I think it is really important to add some perspective to the recent Klaytn outage which can get lost in the technicalities of the situation (especially for those who are still learning about blockchain as a technology).
The key takeaway from the outage is the following: the core components of the Klaytn blockchain performed exactly as they should without fail, and this is precisely why it had an outage.
There is an important computing concept which is particularly important for blockchain architecture, called the CAP Theorem. The CAP Theorem stands for Consistency, Availability, and Partition Tolerance.
- Consistency refers to the integrity of data…maintaining the correctness of the information we are validating and storing on the Klaytn blockchain.
- Availability refers to the operation and running of our blockchain to continue to process transactions.
- Partition Tolerance refers to the ability to continue operation despite a lack of connectivity between a portion of the nodes of the network.
The CAP Theorem states that you cannot have all three of these properties in a distributed system (like the Klaytn blockchain) at the same time, only two. This fact, in turn, means that when you design a blockchain system you must choose which two of these three properties your system will prefer.
Generally, all blockchains require some amount of Partition Tolerance because they are global networks which will (at times) have failed connections, dropped messages, and other communication faults that they need to protect themselves against.
To achieve Partition Tolerance some blockchains use Proof of Work for their consensus (like Bitcoin) which has a partition tolerance of 50%, others (like Klaytn) use a Proof of Stake model based on Practical Byzantine Fault Tolerance, which generally have a tolerance of 33%. This tolerance limit describes the network’s ability to continue operation despite participant connectivity issues or consensus dis-agreement about the state of the network. It represents the portion of default a network can handle before it has to make a choice of which of the other two CAP properties to maintain: Availability, or Consistency
Some blockchains (like Bitcoin or Ethereum) choose Availability. This means that if there is a partition in the network, or a disagreement in the consensus process, which exceeds the Partition Tolerance level, that system will choose for the network to remain operational and Available. This choice comes at a trade-off. It means that this kind of network has to sacrifice Consistency (data integrity) and therefore any of the data which is recorded during this partitioned period is not 100% finalized and could be changed. In fact, Bitcoin is so strongly available that it exhibits a property called “probabilistic finalization” which means that none of the data on the Bitcoin blockchain is ever actually 100% finalized, instead the longer the past the network has recorded that data and the deeper it is recorded into the blockchain, the more probable it is to be correct and consistent. In fact, Bitcoin has had two large historical cases where its data was rewritten because of its preference for Availability.
Other blockchains choose Consistency instead. For strongly consistent blockchains (like Klaytn), this means that their data is “instantly finalized”. In other words, the data which is recorded to the Klaytn blockchain is guaranteed to be correct and consistent with the rules of crypto-validation as soon as it is placed in the blockchain records. The rules of crypto-validation are very strong and cannot be fooled or broken by being based on equally strong mathematical principles. They are the foundation of why blockchains can achieve their trust system despite being open distributed systems. The trade-off for choosing strong Consistency is that sometimes the network becomes unavailable when the Partition Tolerance threshold is broken.
This is exactly what happened on the Klaytn network on November 13th 2021. There was an unchecked memory sharing bug in the Account Updating transaction code. This bug was trying to introduce data onto the Klaytn network which broke the very strong rules of crypto-validation. Klaytn’s consensus algorithm is configured to not let this happen (remember it prefers Consistency and is instantly finalized) and it did its job perfectly. The block with this incorrect data was not allowed by consensus to be recorded into the blockchain and because of this we experienced network unavailability, more commonly known as a network outage.
To make matters worse, Klaytn has a very robust gossip model to transfer data and communicate between the nodes of the network to make new data accessible to the rest of the network as efficiently and as fast as possible (this helps to prevent network partitions). In addition to this, transmitted data is temporarily stored on nodes around the network to be passed on to their neighbors to increase data propagation even more. This fact extended the outage because the invalid transaction (the one breaking the crypto-validation rules) was readily accessible everytime the network tried to restart the process of building and accepting a new block.
In the end, these three factors culminated in the outage we as a Klaytn community experienced recently:
- The Klaytn network makes its data accessible as fast and efficiently as possible through robust propagation.
- The Klaytn consensus algorithm holds data integrity as the highest standard, and
- an edge case memory sharing bug in transaction processing broke the rules of crypto-validation.
It is our hope this explanation helps shed some light onto the events that caused and led to the outage and brings some openness and visibility to our community members. That being said, Klaytn will continue to strive to improve these mechanisms so that outages can be avoided, as well as continue to look for ways to make the network more robust and resilient. With the support of our team and our community we are certain to achieve these goals moving forward.
Particularly, Klaytn is bringing wider initiatives to increase community inclusivity along with open source and open network programs that focus on community developer communications and relations. With this, we are working hard to bring better analytics and explorer tooling and bounty programs to reward the community to help make Klaytn the best it can be.
By Terry Wilkinson, Head of Development at Klaytn
About Terry Wilkinson — Head of Development at Klaytn
Terry Wilkinson is a distributed ledger technology expert with over 7 years of industry experience. He is currently the Head of Development for Klaytn where his focus is on building Klaytn 2.0 into a globally competitive blockchain base layer for the next generation of decentralized products and services.
Over his career, Terry has worked on several blockchain centric startup and enterprise products including: FinTech, health industry, NFT marketplace, and DeFi solutions. He has had eyes (and hands) on many of the most popular Layer 1 protocols, among them: Ethereum, Tendermint/Cosmos, Polkadot, Bitcoin, and Klaytn.