How to Efficiently Process and Manage Blockchain Data
Author: Klaytn Foundation Core Development Team
As blockchains like Bitcoin and Ethereum continue to evolve, the amount of data they generate grows exponentially. This results in storage and processing challenges for blockchain nodes that store the full ledger and validate transactions. Techniques like StateDB Live Pruning help nodes operate efficiently by reducing storage requirements and improving performance.
In this guide:
- The Growing Storage Requirements of Blockchains
- The Role of StateDB in Blockchain Storage
- Why Existing Pruning Methods Have Limitations
- Introducing Exthash to Enable Live Pruning
- Live Pruning Keeps the StateDB Lean
- Ongoing Optimizations for Different Node Types
- Frequently asked questions
The Growing Storage Requirements of Blockchains
At the core of any blockchain is the ledger that records all transactions that have occurred on the network. This ledger is distributed across nodes that participate in consensus and transaction validation. As the blockchain processes more transactions over time, the ledger grows consistently.
For example, the Bitcoin blockchain adds blocks containing transaction data every 10 minutes. With each block being 1MB in size, 144 blocks are added daily, totaling 144MB per day. At this rate, the Bitcoin ledger grows by over 50GB annually. The Ethereum blockchain with its more complex data structure and smart contract capabilities generates data even faster, at 1 block every 15 seconds with larger block sizes. This causes the Ethereum ledger size to expand rapidly, nearly doubling every year.
All this data needs to be stored by full nodes that hold a complete copy of the blockchain. In mid-2023, the Bitcoin ledger size was close to reaching 500GB. For Ethereum, it had already surpassed 1TB. Storing such massive amounts of data can be cost-prohibitive and technically challenging for individual node operators. As a result, running fully decentralized and independent node infrastructure becomes more difficult.
The Role of StateDB in Blockchain Storage
In many blockchain platforms like Ethereum and Klaytn, a key data structure called the State Database (StateDB) takes up significant storage space. The StateDB holds crucial information like account balances, contract data, and blockchain state that is necessary for validating transactions.
As transactions occur, the StateDB gets updated continuously to reflect the latest state. This means frequent modifications like account balance changes and contract data revisions.
The StateDB uses a Merkle Patricia Trie (MPT) structure to store state data efficiently. But the downside is that even small changes in data cause the entire MPT to change dramatically. Studies show that a single updated node can result in over 10 new nodes being added to the MPT. This exponential growth characteristic quickly bloats up the StateDB size. The massive capacity required is challenging for validators and miners running full nodes to manage.
Why Existing Pruning Methods Have Limitations
To handle the ever-expanding storage needs of the StateDB, blockchain developers have worked on pruning techniques to safely delete old and unnecessary data.
Pruning methods like StateDB Offline Pruning require taking node servers offline, migrating data to new infrastructure, and re-syncing the blockchain. This causes considerable node downtime ranging from hours to days depending on the blockchain size. Such complex coordination also limits the feasibility of frequent pruning operations.
A key limitation of pruning is that data structures like MPTs use shared, nested data between nodes. This means node A may reference the same data as node B. Naively deleting node B in such cases will cause irrecoverable data loss and corruption for node A. This challenge of managing multi-referenced data has prevented efficient online pruning so far.
Introducing Exthash to Enable Live Pruning
To overcome the multi-reference issue, Klaytn developers introduced a modified hash function called Exthash. Exthash works by appending a 7-byte serial number to regular 32-byte hashes. This serial number acts as a unique identifier, preventing the same data from having identical hashes.
By replacing regular hashes with Exthash across the StateDB, duplicate references can be eliminated. If node A previously linked to node B using the same hash, this link now contains the unique Exthash identifier. Later, when node B gets deleted during pruning, node A is not affected since its references use the distinct Exthash value.
To implement Exthash, the existing Merkle Patricia Trie structure was retrofitted to use the new extended hashes. Individual nodes represent key-value pairs, where keys are object hashes and values store actual data. By changing the hash generation to utilize Exthash instead of regular hashes, redundancy can be averted.
The 7-byte nonce appended by Exthash produces unique hashes even for identical data. For instance, if two nodes contain “Balance: 100”, their Exthash values will differ based on the nonce. Now node A can safely reference node B without conflict. If B gets deleted later during pruning, A’s link remains intact pointing to the distinct Exthash.
With Exthash enabling safe deletion of shared data, the StateDB can now be pruned live while the blockchain is fully operational. This eliminates the downtime and complex re-syncing challenges posed by traditional offline pruning techniques.
Live Pruning Keeps the StateDB Lean
Klaytn implemented StateDB Live Pruning in version 1.11, allowing automatic deletion of old StateDB data. By default, only the StateDB data of the last 2 days is retained, while older information gets periodically pruned.
This keeps the active StateDB size optimal for I/O performance, typically in the range of 150GB to 200GB. With storage requirements reduced substantially compared to a full archive, full nodes can operate smoothly without getting choked by bloated data. The smaller StateDB also provides caching benefits, speeding up various blockchain operations.
The introduction of Exthash does add some incremental overhead related to generating and storing the 7-byte extensions. However, benchmarks reveal the overall system efficiency still improves by over 20% from the storage and caching gains enabled by live pruning.
In addition to storage space, StateDB Live Pruning also saves bandwidth for node operators. When initially syncing a node, the pruned StateDB of the last few days minimizes the amount of data that needs to be downloaded. This makes running a node more accessible for participants across the globe.
Ongoing analysis helps determine optimal pruning frequency and duration. Factors like disk performance, network speeds, and node types impact ideal pruning configurations. The pruning parameters can be tuned over time as blockchain data characteristics evolve.
Ongoing Optimizations for Different Node Types
Live StateDB pruning delivers the most value to consensus and validation-only nodes that require fast access to recent state data. Maintaining historical data primarily for analytics may not benefit such nodes.
Future work involves efficiently separating hot recent data and cold older data across different storage systems. For example, cheaper slow hard disks can store historical data for analysis, while fast SSDs handle hot recent information for time-sensitive processing.
More optimizations like extracting transaction data from recent blocks are also being considered. Transactions can be moved to separate storage while only state changes are kept in the hot StateDB. With Live Pruning proving effective, Klaytn developers can focus on further enhancements tailored to diverse node types.
Conclusion
As blockchain data grows exponentially over time, efficient storage and pruning are essential for nodes to operate smoothly. By intelligently eliminating redundant shared data references using Exthash, Klaytn enables continuous live StateDB pruning. This keeps nodes lightweight by retaining only recent blockchain data required for validation.
Ongoing research is improving pruning further by segregating data across hot and cold storage based on utility. With innovative solutions like Live Pruning, blockchains can scale sustainably despite the immense volumes of data they generate.
FAQ
What is a StateDB in blockchain?
- A StateDB is a database holding account and storage information representing the state of a blockchain. It uses a tree (Merkle Patricia Trie) to organize data.
Why does a blockchain’s StateDB grow so large?
- As transactions occur, the immutable StateDB accumulates data over time. Changing data also expands the underlying trie structure. This exponential growth consumes substantial storage.
How does StateDB Live Pruning work?
- It uses a new hashing algorithm called Exthash to eliminate duplicate data references. This allows safely pruning old StateDB data while the blockchain operates normally.
What is Exthash?
- Exthash is an alternative hashing approach that adds a 7-byte serial number to hashes. This guarantees each hash iteration is unique, eliminating the issue of duplicate keys.
What are the benefits of StateDB Live Pruning?
- It improves node performance by keeping the StateDB small and lean. This reduces storage needs and boosts I/O caching performance. It also doesn’t cause blockchain downtime because it happens seamlessly while the blockchain remains fully operational due to Exthash enabling safe shared data deletion.