EVM blockchain Scalability & ACID Database Properties

The main data structure that holds accounts, balances, code and contract storage in an EVM-based blockchain is the world state trie. The efficient management of this data structure is one of the most important topics in scaling these blockchains. How and where the trie is stored, and how fast it can be accessed and modified are crucial parameters of the performance of the network client. If the trie fitted in RAM, then most of the problems would be solved. But it generally does not, so how the trie is persisted is of extreme importance. Ultimately the trie is composed of nodes that are serialized and stored in a database on disk. We’ll call this state database the statebase, for short. Currently the statebase must be stored in Solid State Drives (SSDs) as hard disks are too slow for scattered reads, as usually required by network clients.

During the last years, network clients improved as the block gas limit was increased, but improving the statebase has been challenging. The statebase is currently the main EVM performance bottleneck. For example, the Erigon/TurboGeth team switched statebases 3 times to improve performance. A new statebase (LMPT) was also designed to tackle this problem.

The problem has been also tackled from the protocol standpoint. Access-lists (EIP-2930) enable prefetching statebase data to reduce later stress during block execution. EVM state I/O opcodes were repriced several times to account for statebase slowdowns as the state grew (see EIP-1884 for the latest one). The Ethereum Foundation and Rootstock researchers have explored storage rent for several years in an attempt to reduce the I/O bottleneck in full nodes. Currently most of the research relates to the design of stateless clients, trying to remove all state database accesses altogether in network clients. Yet, most of the knowledge about the statebase structure is scattered over forums posts and articles.

In this article we concentrate in the ACID properties of the statebase. We show that the ACID properties that are generally desired for large distributed databases are not mandatory for a statebase and by relaxing some of these properties we can provide superior statebase I/O performance. The fact that ACID properties add a lot of overhead is not new, and similar findings were shown in 2008 when large in-memory databases started to be used. The ideas and results shared in this article were the insight that triggered IOV Labs research and development of FlatDB, a new database engine optimized for statebases that measures a 20x improvement in read performance in our synthetic benchmarks.

ACID Properties for the Statebase

Modern databases provide the “ACID” properties. The ACID properties of a database are: Atomicity, Consistency, Isolation and Durability. We present a simplified and brief explanation of these properties, detailed enough for our purposes.

Atomicity. In a transaction involving writing two or more discrete pieces of information, either all of the pieces are committed or none are.

Consistency. A transaction either creates a new and valid state of data, or, if any failure occurs, returns all data to its previous state before the transaction was started.

Isolation. A transaction in process and not yet committed must remain isolated from any other transaction.

Durability. Committed data is saved by the system such that, even in the event of a failure and system restart, the data is available in its correct state.

The ACID properties allow databases to be used successfully for the most stressful data storage and retrieval task. Since a database must be generic to support the most demanding use cases, all modern distributed databases provide ACID.

The key element of the ACID properties is a transaction. A transaction is a data retrieval from or update to the database that must satisfy atomicity (either it is performed, or it is not performed at all), durability (if it is performed, then it will never be reverted), consistency (all the data retrieved when processing the transaction corresponds to a certain database snapshot of the database, and at the end of the transaction a new snapshot can be taken) and durability (the results of the transaction are never reverted). Additionally, all changes to the database must be made of transactions. In this regard, a transaction represents the “minimum” amount of allowed change.

Most Key/value databases are not considered transactional. In LevelDB some of the ACID properties are optional. But this depends on the definition of a transaction. A typical transactional database transaction is specified in a language such as SQL. Key/value database transactions are much simpler. For example, they do not allow atomic compare-and-swap operations, nor incrementing the value related to a key atomically. LevelDB cannot perform atomic writes that depend on reads, as it doesn’t provide a language to construct such dependencies. This restriction decreases the complexity of the database engine and enables performance optimizations. As alternative features, some key/value databases support batch writing and some support read snapshots. Geth client works pretty well using LevelDB.

Ethereum’s Statebase

The Ethereum yellow paper does not specify what type of databases needs to be used for a network client. Node implementers are free to experiment with different database engines. When designing Ethereum, Vitalik opted to use a data structure where nodes are linked by their hash digest, which adds a level of abstraction to the way nodes are stored. An EVM-compatible blockchain could instead use a specific world state data structure where nodes are identified directly by their persistent storage locations, but no EVM-compatible blockchain has attempted to do so. Luckily, even if the hash indirection presents challenges, both the node data is accessed in certain structural patterns, and clients can exploit this inherent structure to their benefit.

In this article we’ll show that when we take into consideration both the inherent structure of Ethereum’s statebase and the real world requirements of a blockchain network client, we can use a database engine having relaxed ACID properties. The relaxed statebase can provide the same level of robustness if the full node follows a special protocol to handle writes (it must obey a certain order) and handle failure cases correctly. To justify why, we start by defining what a statebase transaction is, and then show the ACID properties in relation to these transactions.

Simplified Transactions

The statebase mainly stores trie nodes, and sometimes also the node’s long values. In all standard full nodes, a statebase transaction is either a batch of key/value write operations that must be committed together or a single key/value read operation. The full node doesn’t need to mix statebase delete operations nor read operations with write operations in the same transaction. In fact, an efficient full node can be programmed such that it never needs to delete individual statebase values at all. While a single write operation is a transaction in all key/value databases, we focus on batch writes, because most EVM-blockchain nodes perform write batching for greater efficiency.

Now we start investigating the prerequisites of a statebase. We first show that the statebase does not even require atomicity nor isolation in batch operations, provided the node uses certain algorithms to find the last fully committed block state.

We introduce several unique properties of the state database: Guaranteed data existence, Value immutability, No key/value removals, No range queries, No need for Isolation, and Simplified Keys.

Guaranteed Data Existence

An EVM blockchain network client never makes a consensus decision based on the existence of a trie node in the statebase. It will never need to check if a node’s key exists in the statebase. It will normally just fetch the node data associated with a key. This can be verified in the Ethereum yellow paper: there is no EVM operation that performs an existence query. If a trie node does not exist in the database, then either it needs to be fetched from somewhere else or the network client must halt processing.

As the yellow paper does not describe state pruning, but most network client implementations do, we must assume that as long as a trie node is linked by the state tree of one of the latest N blocks, it will not be removed from the database. We also assume that the blockchain will never revert more than N blocks, as that is considered a catastrophic network failure.

Value Immutability

One of the properties of the statebase is that data is associated univocally with its key by a hash digest function. This means that once a certain key has been stored in the database, its data can never change. The assignment of different values to contract storage cells does not affect this property, because new values result in completely different trie nodes referenced by different keys.

No Key/value Removals

An archive node does not discard any part of the statebase. However, normal full nodes do prune the statebase to avoid storing an ever growing state. Nevertheless, the pruning of the state database can be handled entirely, and often much more efficiently, by copying some parts of the database into a new database, and completely removing the old database, similar to the mark-and-sweep garbage collector. In fact, the RSK network client uses chained databases, periodically moving data from the oldest to the newest, to efficiently discard old state without ever performing single element database deletions.

No Range Queries

The EVM network client never needs to retrieve all possible keys or key/values in a certain range for blockchain synchronization as part of block execution. Range queries can be useful for non-standard tasks such as debugging block execution or statebase dumping, but generally these infrequent operations do not need to be specially efficient. Several database engines such as those based on B+trees are optimized for range queries. Removing range queries enables statebase designers to consider other types of data structures that are more efficient for lookups.

No Read Isolation Requirement

The DBMS Isolation levels do not apply to simple key/value databases which do not provide a SQL console. Originally the Ethereum full node would never perform batch reads, as each individual read opcode (i.e. SLOAD) would trigger an independent statebase access. Later, when EIP-2930 added transactions with access-lists, some clients started making use of batch reads. Batch reads can sometimes be optimized by the database engine making scattered reads in a single OS call or parallel reads with multi-threading. However, this doesn’t change that there can’t be harmful interference between reads and writes. We show why.

Read Isolation means that a transaction that reads the database at a specific moment will not get any partial data from concurrent batch write operations, either committed or still being processed. But with the properties previously presented (data immutability, no removals and guaranteed existence), it’s impossible for concurrent writes to affect a read operation, as the read operation, even if involving multiple key/value reads, will never access a key outside the set of keys known to be previously stored in the database. We conclude that a statebase does not need to isolate reads.

No Write Isolation Requirement

Write isolation means that the result of applying two batch write transactions concurrently will be as if both had been performed serially. We show that the trie state database does not benefit from a database engine providing write isolation.

Each trie node in the database is indexed by its cryptographic hash digest. Even if two concurrent transactions attempt to write the same key/values, the values written will always be the same. Also, batch writes do not contain key/value removals, so no key written can be simultaneously erased. Therefore no interference can occur between batch writes. Even if two threads perform two batch write operations with some key/value overlap, the result would be equal irrespectively of the order. This would be true even if the two batches were applied with intermingled operations. No isolation is needed because all write operations are commutative.

Simplified Keys

Trie nodes are always accessed by pointers coming from other stored trie nodes or from the block state roots. The key of each value written to the statebase corresponds to the cryptographic hash of the node data. This enables disposing of the key itself, saving space on memory and disk. When the key is required, it can be computed dynamically from the data. Several data structures such as hashtable can be compressed by not storing the keys.

It’s interesting to note that if we are allowed to change the blockchain consensus rules, then we can use trie keys that contain more than one field, provided the fields can be computed from the value. For example, we can build node keys that contain a hash of the data plus a new integer field logSize computed as the ceiling of the base-2 logarithm of the data size. The logSize field can be useful to store key/values in different databases according to their sizes. The node can lookup the key/value in a database for objects of a specific size range, facilitating garbage collection.

Relaxed Durability Requirement

In most real world databases the cost of the stored data is high. The destruction or corruption of all or even a small part of the data can have a high commercial, reputation or opportunity cost. When backups need to be restored, the most recent data may be irrevocably lost, and there is always a risk that the last backup is corrupted or inaccessible, leading to greater problems.

In contrast, if the network protocol supports fetching parts of the state, as GetNodeData in Ethereum’s wire protocol or the Ethereum Snapshot Protocol, then an incomplete statebase can be reconstructed with very high probability and very fast, because there are other nodes in the network to download the data from. Database reconstruction will be cheap as long as we only need to reconstruct a small portion of the database.

This is essentially the same property that enables network clients to cache many key/value writes fully in memory (either in the application or inside the database engine), and flush them to the filesystem at a later time, increasing the write throughput, at the expense of a certain data loss risk in case of a power failure.

Not surprisingly, this is what LevelDB actually does by default. LevelDB sacrifices durability for performance, persisting changes asynchronously by default, although it’s possible to configure the database so writes are performed synchronously. We define the property of “delayed durability” as a database that only periodically flushes changes to disk, but in case of interruption only the last transaction commitments can be lost. The database should never skip a committed transaction. LevelDB seems to satisfy this, at least if the application is performing writes from a single thread. This property is the consequence of LevelDB’s commit log, which appends entries and thus preserves the order of the results.

Tolerance to Database Corruption

Normally, database corruption can lead to disastrous consequences on information systems. Luckily the statebase is very special and data corruption can be reliably detected on access! This is a very strong property. Database engines usually insert checksums in data to detect corruption, and this is an additional overhead. But the statebase does not need checksums.

A statebase always needs to be designed to prevent structural corruptions, which are errors during database updates that can render the full database inaccessible. A statebase can be designed such that, in the case of a power failure, the only two errors that can occur are (1) the missing value related with a key, or (2) a mismatch between the data hash digest and the key. Both can be checked on access. Reliable detection of database corruption means that the network client will never perform computations with erroneous data. Therefore, a low rate of statebase inconsistencies can be tolerated, and the statebase can be repaired on-the-fly transparently by fetching the missing pieces. If failures at different network clients are uncorrelated, a local error cannot lead to a full network halt. Statebase inconsistencies cannot be abused by an attacker to perform a targeted double-spend attack to a client.

To reduce the risk of database corruption, we only need to design the database to provide structural integrity, such that it’s always accessible, even if a few datums are corrupted or missing. If the database provides structural integrity we can drop the transaction atomicity requirement. Better, we can architect the network client so that if the client performs a request to the statebase object, and the key is missing in the statebase, it is automatically fetched from a peer, and the client continues operating normally without canceling the request.

Relaxed Availability Requirement

We mentioned that the network client can tolerate some statebase data loss, as long as there is a low probability of such events or the amount of data lost is low. Something similar happens with the liveness requirement of the statebase. If the state database engine prevented access to the data for 1 minute every week at a random moment to perform an internal housekeeping process, such as garbage collection, this delay could be easily tolerated by most systems. If this is a commercial blockchain system that serves many RPC clients, then there are probably many network client instances running in parallel. Therefore the system can rebalance the load of the requests letting the network client perform the housekeeping tasks without sending it more requests. If the system belongs to a home user, then it’s highly likely that the user can wait 1 minute from time to time, as this wait is comparable to Ethereum block interval (12 seconds) or Rootstock block interval (30 seconds).

Events such as deterministic maintenance procedures that pause block processing at specific block heights cannot be tolerated by a cryptocurrency network, because the whole network becomes unavailable simultaneously. But the network can tolerate uncorrelated pauses at random in network nodes. To avoid unintended and unexpected synchronizations between network clients typical of complex distributed systems, database maintenance interruptions must be chosen carefully pseudo-randomly by the client itself. Maintenance times should not be related to powered-on times.

As an example, if all trie databases required periodical maintenance taking 0.1% of the time, and the maintenance event was randomized, at any moment 99.9% of the nodes would be alive. As long as each network node is connected to a few other nodes, the liveness property of the network would still be guaranteed.

Relaxed Atomicity Requirement

All read operations from the database are performed by traversing the nodes using node pointers, and all key/values stored are part of a forest of trees (or DAGs). Therefore, if a tree of nodes is stored starting from the leaves up to the trunk, the consistency of the database is assured the moment the top node is written to the filesystem. When the network client is launched, it can search backwards for the last fully committed block, starting from the last block that has its state root stored in the state database.

Therefore it seems that even if the database would not provide atomicity, the network client can be written in such a way that the lack of atomicity cannot negatively influence the client. We note that, to implement this algorithm correctly, the storage system must avoid type confusions between long data values and trie nodes, because they are generally stored in the same database. If not, the algorithm could confuse a long value with a root node and assume the existence of a state trie in the database, which does not really exist.

Putting All Together

The EVM world state database, or statebase, is different from most other databases. It has a specific structure, and it is accessed by network clients in specific patterns. We can take advantage of its unique properties to design a statebase engine that maximizes network client throughput. We can relax all ACID properties to design a more efficient statebase, and yet maintain robustness for all known use cases.

This analysis led to the design and implementation of the FlatDB database engine, a high-performance key/value database designed specifically for storing the statebase. Our preliminary results are encouraging!
The FlatDB database is already open source, but the code is still experimental.

Our next steps:

  • Formally present FlatDB as an alternative for all EVM-based full nodes.
  • Publish more articles about FlatDB internals in this medium.
  • Submit a PR for the Rootstock network client to enable using FlatDB to store the world state, apart from the currently supported databases (LevelDB and RocksDB). We expect that FlatDB to improve Rootstock scaling significantly.

--

--

Sergio Demian Lerner
RootstockLabs: Research & Technology

Cryptocurrency Security Consultant. Head of Innovation at IOV Labs. Designer of the RSK sidechain (https://rsk.co)