The Evolution of State Sync: The path to 100k+ transactions per second with sub-second latency at Aptos

Published in

Aptos

13 min readJun 7, 2022

TL;DR: The Aptos blockchain leverages a wide range of novel techniques to ensure high-throughput, low latency, verified state synchronization in a decentralized network. Peers can validate and synchronize over 10k transactions per second (TPS) with sub-second latency in Aptos today, and we’re already on our way to 100k+ TPS.

Overview

State synchronization (or state sync) is an important but often overlooked aspect of blockchain design. In this blogpost, we discuss the evolution of state sync at Aptos and present several key insights behind the design of the latest state sync protocol. We further explore how we recently increased state sync throughput by 10x, reduced latency by 3x, and are continuing to pave the way for faster, more efficient blockchain synchronization.

What is state sync?

Most blockchains today are structured hierarchically, with a set of active validators at the heart of the network. The validators grow the blockchain by executing transactions, producing blocks and achieving consensus. The rest of the peers in the network (e.g., fullnodes and clients) replicate the blockchain data produced by the validators (e.g., blocks and transactions). State sync is the protocol that allows non-validator peers to distribute, verify and persist this blockchain data and ensures that all peers in the ecosystem are synchronized. See the diagram below for what this looks like at Aptos.

Why does state sync matter?

State sync is rarely mentioned when evaluating blockchains: it is often a footnote in a whitepaper dedicated to more interesting topics*. However, state sync has significant consequences on blockchain performance, security, and user experience. Consider the following:

Time to finality and user experience: When new transactions are executed by validators, state sync is responsible for propagating the data to peers and clients. If state sync is slow or unreliable, peers will perceive long transaction processing delays, artificially inflating the time to finality. This has a huge impact on user experience, e.g., decentralized applications (dApps), decentralized exchanges (DEXs) and payment processing will all be much slower.
Relationship with consensus: Validators that crash or fall behind the rest of the validator set rely on state sync to bring them back up to speed (i.e., synchronize the latest blockchain state). If state sync cannot process transactions as quickly as they are executed by consensus, crashed validators will never be able to recover. Moreover, new validators will never be able to start participating in consensus (as they won’t ever catch up!) and fullnodes will never be able to sync to the latest state (they’ll continue to fall further behind!).
Implications on decentralization: Having a quick, efficient and scalable state sync protocol allows for: (i) faster rotations of an active validator set, as validators can move in and out of consensus more freely; (ii) more potential validators to choose from in the network; (iii) more fullnodes to come online quickly and without having to wait long periods of time; and (iv) lower resource requirements, increasing heterogeneity. All of these factors increase decentralization in the network and help to scale the blockchain in size and geography.
Data correctness: State sync is responsible for verifying the correctness of all blockchain data during synchronization. This prevents malicious peers and adversaries in the network from modifying, censoring or fabricating transaction data and presenting it as valid. If state sync fails to do this (or does it incorrectly), fullnodes and clients could be tricked into accepting invalid data and this would have devastating consequences on the network.

Reasoning about state sync

To better reason about state sync, we first introduce a generic model for blockchain execution. While this model targets the Aptos blockchain specifically, it can be generalized to others.

We model the Aptos blockchain as a simple versioned database, where at each version V there is a unique blockchain state Sⱽ containing all on-chain accounts and resources. A transaction T can be executed by the Aptos virtual machine (VM) on state Sⱽ to produce a state delta Dᵀᵥ representing all state modifications that would occur on Sⱽ should T be committed. When Dᵀᵥ is applied to Sⱽ (that is, we consider T committed), it results in a new version V+1 and a new blockchain state Sⱽ⁺¹. We refer to the very first blockchain state as the genesis state S⁰. The diagrams below show transaction execution and state delta application.

Transaction Execution and State Delta Application (Model)

What are the goals?

With a generic model in place, we can now define several fundamental goals for state sync**:

High-throughput: State sync should maximize the amount of transactions that can be synchronized by each peer per second. That is, maximize the rate of state transitions from Sⱽ to Sⱽ⁺¹ for each T committed by the validators. If throughput is low it increases syncing time and becomes a bottleneck for the network.
Low latency: State sync should minimize the time it takes up-to-date peers to synchronize new transactions committed by validators. That is, minimize the time it takes peers at Sⱽ to sync to Sⱽ⁺¹, for each T that is newly committed by the validators. This affects the overall time to finality perceived by clients.
Fast bootstrapping time: State sync should minimize the time it takes new (or crashed) peers to sync to the latest state of the blockchain. That is, minimize the time it takes to sync to Sⱽ (where V is the highest database version agreed by the validators) regardless of the peer’s current version P and state Sᴾ. This allows peers to perform useful work more quickly (e.g., respond to balance queries or validate transactions).
Resistance to failures and malicious actors: State sync should be resistant to failures (e.g., machine and network failures) and tolerate malicious actors in the network, including other peers. This means overcoming a wide-variety of attacks, e.g., fabricated transaction data, modified or replayed network messages, and eclipse attacks.
Tolerant of resource constraints and heterogeneity: State sync should tolerate resource constraints (e.g., CPU, memory and storage) and embrace heterogeneity. Given the nature of decentralized networks, peers will have access to different types of hardware and optimize for different goals. State sync should account for this.

The required building blocks

We next introduce a set of fundamental building blocks required to construct a state sync protocol. For the sake of brevity, we provide a summary of each building block below and defer the technical details to future work (each of these could be a blogpost, itself!):

Persistent storage: To persist data across machine crashes and failures (and enable data distribution to other peers!) we require that each peer has access to trusted persistent storage. At Aptos, we’re currently using RocksDB but are actively exploring other options.
Verifiable blockchain data: To prevent malicious actors from modifying blockchain data, we require data to be authenticated and verifiable. Specifically, we need to be able to prove: (i) every transaction T that has been executed and committed by validators; (ii) the order of every transaction T executed and committed by validators; and (iii) the resulting blockchain states Sⱽ after committing each transaction T. At Aptos, we achieve this by: (i) building merkle trees over committed transactions and the resulting blockchain states; and (ii) having validators sign the merkle roots of these trees to authenticate them.
A root of trust: Given that Aptos supports dynamic validator sets (i.e., changes to validators at each epoch), peers need to be able to identify the current validator set from a verified history of the Aptos blockchain. We achieve this through: (i) a genesis blob authenticated by Aptos, which identifies the first validator set and initial blockchain state S⁰; and (ii) a recent trusted waypoint (e.g., a hash of the current validator set and blockchain state Sⱽ). Together, the genesis blob and waypoint form a root of trust allowing peers to synchronize the real Aptos blockchain and prevent attacks (e.g., long-range attacks).

Achieving 1k TPS: A naive approach

Using the model and building blocks presented above, we can now illustrate a naive state sync protocol. This protocol is a simplification of the original protocol used by Aptos (i.e., state sync v1).

The protocol works as follows: (i) Alice (the syncing peer) identifies the highest local persisted blockchain version V and state Sⱽ. If none exists, Alice uses genesis, i.e., S⁰; (ii) Alice then randomly selects a peer, Bob, and requests any new sequential transactions that have been committed by the validators; (iii) if Alice receives a response from Bob, Alice will verify the new transactions (T⁰ to Tᴺ) and execute them to produce state delta’s (D⁰ᵥ to Dᵀᵥ₊ₙ); (iv) Alice will then apply the new state delta’s to storage along with the new transactions, updating the local state of the blockchain from Sⱽ to Sⱽ⁺¹⁺ᴺ. Alice repeats this loop indefinitely. Pseudocode for the protocol looks as follows:

We implemented this protocol at Aptos, benchmarked it on devnet, and analyzed it. Some of the key observations we made were:

Throughput is network latency bound: This protocol achieves a maximum of ~1.2k TPS. However, throughput is heavily affected by network latencies as Alice requests data sequentially and must wait for the peer to respond. Given that we’re seeing an average network round trip time (RTT) of 150ms in devnet, this is suboptimal.
CPU is dominated by execution: When Alice receives a new set of transactions to synchronize, we see that 55% of the CPU time is spent executing the transactions and 40% is spent verifying the data, applying the state deltas to storage and persisting the new transactions. The other 5% is attributed to message handling, serialization and other tasks.
Latency is high: When running the network at maximum load, the average latency for Alice to receive new transactions is ~900 ms after they’ve been committed. This is primarily due to Alice randomly selecting peers when requesting data and not taking into account the network topology: peers that are closer to the validators will receive new transactions sooner.
Bootstrapping is slow: The protocol above requires Alice to replay and synchronize all transactions since genesis. If Alice is far behind the latest state, she must wait a long period of time before being able to do anything useful (this might take hours or even days!).
Performance is easily manipulated: The performance of this protocol is heavily influenced by malicious peers. As identified in 1 above, intentionally slow (or non-responsive) peers will force Alice to spend long periods of time waiting for data and not doing anything. Thus, significantly increasing synchronization time.
Resource usage is high: This protocol is expensive for all resource types: (i) CPU usage is high because Alice must re-execute all transactions; (ii) storage is high because Alice must store all transactions and blockchain states since genesis; and (iii) network usage is high because Alice must receive all transactions since genesis via the network. This automatically imposes high costs and resource requirements, reducing heterogeneity.
Resources are wasted: While Alice is synchronizing new data, peers in the network are also synchronizing from her. As the number of peers requesting data from Alice grows, there is an additional read load placed onto storage and CPU required to handle these requests. However, much of the computation that Alice performs to handle these requests is wasteful as peers are often requesting the same data.

Achieving 10k TPS: An optimized approach

Looking at the naive protocol above, we can make a number of modifications to help address the limitations. First, we extend the protocol to support 2 additional modes of synchronization:

State delta syncing: Given that validators already execute transactions and attest to the resulting blockchain states through merkle proofs, peers can rely on the state deltas produced by the validators to skip transaction execution. This avoids: (i) the high cost of execution, reducing CPU time by about 55%; and (ii) the need for an Aptos VM, greatly simplifying a delta syncing implementation. As a result, peers can now synchronize by downloading each transaction T and state delta Dᵀᵥ and applying them to storage to produce the new state Sⱽ⁺¹. We note that this comes at the cost of increased network usage (by approx. 2.5x).
Blockchain snapshot syncing: Given that validators attest to each blockchain state Sⱽ, we can further reduce bootstrapping time by allowing peers to download the latest blockchain state directly (instead of having to produce it using transactions or state deltas). This significantly reduces bootstrapping time and is a similar approach to snap-sync in Ethereum. The trade-off is that peers will not store any transactions or blockchain states prior to Sⱽ.

Next, we implement a number of general optimizations and additional features to help improve performance and scalability:

Data prefetching: To prevent network latencies from impacting throughput, we can perform data prefetching. Peers can prefetch transaction data (e.g., transactions and state deltas) from other peers before processing them, amortizing network latencies.
Pipelined execution and storage: To further increase syncing throughput, we can separate transaction execution from storage persistence and use pipelining: a commonly used optimization in processor design. This allows a transaction T² to be executed while a transaction T¹ and state delta Dᵀ¹ᵥ are concurrently persisted to storage.
Peer monitoring and reputation: To improve observability and better tolerate malicious peers, we can implement a peer monitoring service to: (i) monitor peers for malicious behaviors (e.g., transferring invalid data); (ii) identify metadata about each peer, such as a summary of all transaction data the peer has and their perceived distance from the validator set; and (iii) maintain a local score for each peer. This information can then be used to optimize peer selection when requesting new blockchain data.
Data caching: To reduce the read load on storage and prevent state sync from performing redundant computations as more and more peers synchronize, we can implement a data cache that stores commonly requested data items and responses in memory.
Storage pruning: To prevent storage from continuously growing over time (e.g., as more transactions are committed), we can also implement dynamic pruners to remove unnecessary transaction and blockchain data from storage, e.g., anything older than a few days, weeks or months, depending on the peer configuration.

We implemented these modifications and produced a new state sync protocol (i.e., state sync v2). We benchmarked it on devnet and observed:

Throughput increased by 5x-10x: When executing transactions (without parallel execution), the protocol now achieves a maximum of ~4.5k TPS, primarily as a result of pipelining and data prefetching (i.e., the protocol is now able to fully saturate the CPU). When synchronizing state deltas the protocol achieves well over 10K TPS, a further result of avoiding transaction execution. In both cases, throughput is no longer affected by network latency.
Latency was reduced by 3x: While running the network at maximum load, we now see that the average latency for Alice to receive new transactions is ~300 ms after they’ve been committed. This is due to data prefetching and more efficient peer selection: peers who are more responsive and closer to the validators are contacted more frequently.
Bootstrapping is significantly faster: Peers who use blockchain snapshot syncing are able to bootstrap much more quickly. Moreover, bootstrapping time is no longer affected by the length of the blockchain (i.e., the number of transactions), but rather the number of on-chain resources to synchronize. Currently in devnet, peers can bootstrap within minutes***.
Resource requirements are reduced: With multiple synchronization modes and storage pruning, resource requirements have been reduced. Moreover, there is now support for heterogeneity as peers have flexibility to select the syncing strategy. For example: (i) peers with limited CPU can skip transaction execution; (ii) peers with limited storage can configure the pruner to be aggressive; and (iii) peers that wish to get up-to-date quickly can perform blockchain snapshot syncing.
Resources are more efficiently used: When handling syncing requests from peers, we see a significantly reduced read load on storage and less wasted CPU. This is due to the new data cache that stores commonly requested data items and responses in memory. We also see that as the number of syncing peers grows in devnet, the data cache becomes more efficient, e.g., with 20 syncing peers we see a cache hit rate of 70%-80% per request. But, with 60 peers we see a cache hit rate of 93%-98%. This comes at the cost of an additional ~150 MB of RAM to maintain the cache.

100k TPS and beyond?

While we’ve already improved throughput by 10x and latency by 3x, we realize that there is still more to be done. Especially if we want to match Block-STM and make Aptos a Layer 1 for everyone!

So, how are we going to get there? Well, we’ve already started on our next state sync goal: 100k+ TPS! Although, we’re planning to save the details for a future blogpost we did want to provide some hints to the very keen reader:

Transaction batching: Currently, Aptos treats every transaction as verifiable, i.e., the merkle proofs used to authenticate and verify data operate at the transaction granularity. This makes verification and storage incredibly expensive. One way to avoid this is to perform transaction batching, i.e., proofs over batches (or blocks!) of transactions.
Network compression: Network bandwidth often becomes a bottleneck in peer-to-peer networks and Aptos is no exception. Currently, the state sync prefetcher can fetch around ~45K TPS in devnet before saturating bandwidth. This is an issue if we want to scale. Thankfully, we’ve already recognized that peers are distributing data using an inefficient serialization format and by using off the shelf compression we can reduce the amount of transmitted data by more than 10x.
Faster storage writes: Currently, state sync throughput is bottlenecked by the time it takes to persist blockchain data to storage. We’re actively looking at different optimizations and improvements that we can make to remove this bottleneck, including: (i) more efficient data structures; (ii) more optimal storage configurations; and (iii) alternate storage engines.
Parallel data processing: Until now, we’ve required that state sync process data sequentially (e.g., handle transactions at sequentially increasing versions). However, there are a number of existing approaches that would allow us to avoid this requirement and exploit parallel data processing to significantly increase throughput, e.g., blockchain sharding!

Until next time!

If you are, like us, passionate about designing algorithms, putting them in practice, and making a real impact on the future of Web3, please reach out — we are hiring at Aptos.

*Like consensus :)
**We implicitly assume that state sync must synchronize the states of all on-chain resources. Partial syncing strategies (e.g., light-clients that target specific accounts) are out of scope and will be discussed in the future.
***This metric should be taken with a heap of salt. The Aptos devnet is wiped bi-weekly, so there is limited state to synchronize. More evaluation will be required on longer running networks (e.g., our incentivized testnets).