Block-STM: How We Execute Over 160k Transactions Per Second on the Aptos Blockchain
TL;DR: We designed and implemented a highly efficient, multi-threaded, in-memory parallel execution engine that can execute over 160k non-trivial Move transactions per second by leveraging the preset order of transactions and combining Software Transactional Memory techniques with a novel collaborative schedule.
Smart contract execution is a major throughput bottleneck for blockchains. After proposing blocks and agreeing on their order, validators have to execute the transactions in the ordered blocks. Crucially, validators must arrive at the same final state, which must correspond to some sequential execution of the transactions. Due to the lack of a foundational solution, current blockchains either execute sequentially or require an embarrassingly parallel workload (i.e. no inner conflicts) for improved performance. Sequential execution does not scale well, and assuming that all transactions are commutative is unrealistic when considering a broad range of smart contracts. In fact, transactions in a blockchain can have a significant number of access conflicts due to potential performance attacks, accessing popular contracts (e.g. due to economic opportunities such as auctions and arbitrage).
Design a parallel execution engine that extracts the maximum inherent speedup that is possible, given the actual access conflicts in the workload. For the best programming and user experience, the algorithm should be transparent to the users. An alternative approach taken by some blockchain parallel execution engines is to force the user to declare the dependencies upfront, which severely limits what a transaction can do and may require a transaction to be broken up or retried. Instead, to avoid incurring such costs and programming annoyances, a system design goal for our parallel execution engine is to manage all conflicts internally and automatically adapt to the workload.
The STM Approach
An academic approach pioneered by Software Transactional Memory (STM) libraries is to instrument memory accesses to detect and manage conflicts. STM libraries with optimistic concurrency control record memory accesses during execution, validate every transaction post execution and abort and re-execute transactions when validation surfaces a conflict. However, due to required conflict bookkeeping and aborts, STM libraries often suffer from performance limitations compared to custom-tailored solutions and thus are rarely deployed in production.
The Blockchain Use-Case
It was shown in the past that STM performance can be dramatically increased when they are applied to specific use-cases. Indeed, three important observations that hold for the blockchain use-case guide the design of Block-STM.
- No need to commit transactions individually: In contrast to general-purpose STMs (where each thread has an infinite stream of transactions to commit based on queries that may arrive at arbitrary times), in blockchains, the state is typically updated per block. This allows Block-STM to avoid the synchronization cost of committing transactions individually. Instead, Block-STM lazily commits all transactions in a block with light synchronization. Moreover, garbage collection is straightforward as the memory can be cleaned up in between the blocks.
- VM provides safety for optimistic memory access: Transactions are specified in smart contract languages, such as Move and Solidity, and run in a virtual machine that encapsulates their execution and ensures safe behavior. This separates the abstractions nicely and allows Block-STM to avoid handling the aftermath of inconsistent states during the parallel speculative execution (a property known as Opacity).
- Pre-defined order reduces synchronization: In general, STM libraries target non-determinism and view determinism as a limitation in the system, hindering performance. This makes them unsuitable for the Blockchain use case because validators executing the same block could result in different final states. However, in Block-STM, determinism is considered a performance blessing. In fact, the final outcome is guaranteed to match a sequential execution of transactions in a fixed, preset order, and this constraint is used to the system’s advantage. This is possible, as previously noted in the Bohm paper in the context of databases, because agreeing on a specific serialization reduces the amount of synchronization required during execution. For example, if there is a conflict between transaction tx5 and transaction tx9, then tx9 will wait for tx5 — otherwise, without order, the threads executing these transactions would need to break the tie. Thus, on an intuitive level, where a general-purpose STM would solve a harder problem (i.e., a form of consensus), Block-STM targets a simpler problem (i.e., it only needs to execute transactions).
Block-STM combines known techniques with novel ideas:
- Optimistic concurrency control: Transactions are executed optimistically in parallel and validated post execution. Unsuccessful validations lead to re-executions. Due to the preset order, validations are not independent of each other and must logically occur in a sequence. Unlike prior work, a successful validation does not imply that the transaction can be committed. On the contrary, a failed validation of a transaction implies that all higher transactions can be committed only if they get successfully validated afterward.
- Multi-version data structure: Block-STM uses a multi-version data structure to avoid write-write conflicts. All writes to the same location are stored along with their versions, which contain their transaction IDs and the number of times the writing transaction was optimistically re-executed. When transaction tx reads a memory location, it obtains from the multi-version data structure the value written to this location by the highest transaction that appears before tx in the preset order, along with the associated version.
- Validation: During an execution, transactions record a read-set and a write-set. During a validation, all the memory locations in the read-set are read and the returned versions are compared to the corresponding versions stored in the read-set.
- Collaborative schedule: Block-STM introduces a collaborative scheduler to coordinate the validation and execution tasks among threads. Since the preset order dictates that the transactions must be committed in order, successful validation of a transaction execution does not guarantee that it can be committed. This is because an abort and re-execution of an earlier transaction in the block might invalidate the read-set and necessitate re-execution. Therefore, a static mapping between transactions and executing threads does not work. Instead, the collaborative scheduler prioritizes execution and validation tasks of lower transactions. However, ordered sets and priority queues are notoriously challenging to scale in multi-core environments. Block-STM sidesteps this problem using a counting-based approach, which is made possible by preset ordering and the compact indexation of transactions.
- Dynamic dependency estimation: Block-STM leverages the preset order to dramatically avoid aborts, which is the name of the performance game for STM systems as aborts can cascade and lead to an excessive amount of wasted work. When a validation fails, the write-set of the transaction’s last execution is used to estimate dependencies by marking all its writes in the multi-version data structure as an ESTIMATION. When another transaction reads an ESTIMATION value from the multi-valued data structure it can then wait for the dependency to be resolved — while without the estimation, it would have proceeded but be likely (if the transaction that wrote ESTIMATION does indeed write to the same location in the next re-execution) to later fail validation and be aborted. Compared to the textbook approach of generating write estimates by pre-executing all transactions embarrassingly in parallel from the pre-block state, our approach has two benefits: (a) estimates are generated only when needed (not for every transaction), and (b) the estimates are generally based on a state much fresher than at the beginning of the block.
We Implemented Block-STM in safe Rust in the Aptos open source codebase, relying on Rayon, Dashmap, and ArcSwap crates for concurrency. We evaluated the system with non-trivial peer-to-peer Move transactions (8 reads and 5 writes). In the graph below we compare Block-STM against a sequential execution of the block. Every block contains 10k transactions and the number of accounts determines the level of conflicts and contention. Under low contention Block STM achieves 16x speedup over sequential execution with 32 threads, while under high contention Block-STM achieves over 8x speedup. Importantly, Block-STM incurs a small overhead when the workload is inherently sequential. Overall, Block-STM is able to dynamically and transparently (without any hints from the user) extract the inherent parallelism from a workload. Detailed comparisons to related work can be found in the paper.
Block STM performance with different contention levels
So How Do We Get This Performance?
The collaborative scheduler together with the multi-versioning technique allows Block-STM to leverage the preset order of transactions in order to estimate dependencies and dramatically reduce the amount of wasted work. The collaborative scheduler is the key piece of the algorithm and contains most of the performance-critical logic. Among other things, it ensures that:
- Every aborted transaction is re-executed, but the same transaction is never executed concurrently by more than one thread.
- If a transaction is re-executed, then all higher transactions must be revalidated. As a result, the same transaction execution might be validated concurrently by different threads, but at most one can abort it.
- A transaction that encounters a dependency is resumed after the dependency is resolved.
- All transactions are eventually committed.
The challenge is to ensure the above with as little synchronization overhead as possible. A naive but costly way to keep track of execution and validation tasks is to have two priority queues (or ordered sets to avoid duplicates). Once a dependency is resolved or a validation is aborted, an execution task for the corresponding transaction is added to the execution queue. Similarly, once a transaction is executed, validation tasks are created for all higher transactions in the preset order and added to the validation queue.
Efficient concurrent ordered set
Instead, Block-STM leverages the preset order of transactions and uses two atomic counters.
These counters track a lower bound index for any transaction that needs to be executed or validated. Combined with a way to know the status of each transaction, these indices can be used to efficiently emulate the ordered set semantics from the naive approach. Threads repeatedly try to grab a task by fetch-and-incrementing the counter with a lower index, and read the status to check if the corresponding transaction is ready for execution (or validation, depending on the counter). Fetch-and-increment also naturally disperses threads to different indices, ensuring that the status information is not heavily contended, which allows us to use mutexes and simplify the implementation (a lock-free implementation is possible, but benchmarking does not reveal significant improvements).
The possible statuses of a transaction are given in the diagram below. The parameter i in the following diagram is the number of times the transaction was re-executed.
We also use locks to synchronize the list of dependencies, but, without care, some subtle races are still possible that the collaborative scheduler has to avoid — the glorious details are in the paper.
Lazy commit mechanism
Block-STM commits the entire block to eliminate the synchronization cost of tracking when individual transactions can safely be committed. In a nutshell, the entire block can be committed when all of the following conditions hold:
- Both execution and validation indices reach the block size.
- There are no ongoing validation and execution tasks.
- The statuses of all transactions are EXECUTED.
As we prove in our #way-too-rigorous-for-a-systems-result proof, the first two conditions in the Block-STM algorithm imply the third one. To atomically verify (1) and (2), we introduced two additional counters. The first counter tracks the number of ongoing tasks. It is too easy to increment or decrement this counter at a wrong place in the code and never observe (2). To avoid such bugs we used the Resource Acquisition Is Initialization (RAII) pattern to couple the counter with task guards, ensuring it is increased and decreased before the task is dispatched and after it is completed, respectively.
To atomically read this counter together with the counters that track the execution and validation indices, we use the double collect technique. This requires introducing a fourth atomic counter to track the number of times the execution and validation counters were decreased. Then, we collect the values of all counters twice (order is important!) and if in both cases conditions (1) and (2) are satisfied, then (as we prove), at some time in between, conditions (1) and (2) were indeed simultaneously satisfied. In this case, the entire block is safely committed, as no more validation and no execution tasks are either needed or will be created.
If you are, like us, passionate about designing algorithms, putting them in practice, and making a real impact on the future of Web3, please reach out — we are hiring at Aptos.