Sitemap
Aptos Labs

Bringing the Future On-Chain

Loader V2: Eliminating the Biggest Performance Bottleneck in Move VM

7 min readMar 21, 2025

--

TL;DR: Loader V2 is a major upgrade of Move VM’s code loading and caching infrastructure.

  • Loader V2 uses multi-level thread-shared code caches to significantly reduce code loading times. In practice, based on the experiments on the Aptos mainnet, block execution becomes up to 60% faster.
  • Loader V2 is integrated into Aptos parallel execution engine (Block-STM), making parallel upgrades to smart contracts possible. Based on our benchmarks, blocks that contain transactions upgrading Move code are ~10x faster.
  • The new design also makes the Move VM stateless, thread-safe, and more reliable.

The background

The Aptos blockchain executes a block of transactions via Block-STM execution engine. Transactions are executed optimistically in parallel, as if there are no data dependencies between them. Block-STM manages reads and writes made by each transaction to detect conflicts, and re-executes conflicting transactions if necessary. Block-STM also uses rolling commit — a mechanism that dynamically detects when a transaction can be committed, i.e., it will not be re-executed.

Each Block-STM thread runs an instance of Move virtual machine (VM), used to interpret Move bytecode specified by a transaction. Throughout execution, Move VM fetches modules that store the smart contract code using a component called loader. Loader deserializes modules, verifies them, as well as their transitive dependencies (e.g., if there are uses of other contracts), and caches the verified modules in Move VM’s cache. Modules can also be republished (i.e., the smart contract is upgraded), and the loader ensures that the new code can be used and linked correctly.

Figure 1: Legacy loader architecture and how it fits into Block-STM parallel execution engine used by the Aptos blockchain. Each thread runs a VM instance with its own module cache, populated as different transactions get executed.

The problem

Legacy loader’s semantics around module loading and upgrades is very implementation-specific, nuanced and fragile, sometimes leading to unexpected errors when users of the Aptos blockchain try to publish packages. Besides that, the legacy loader is a performance bottleneck.

Move VM owns the module cache. This means that module caches are per-thread and are not shared across VM instances ran by Block-STM. As a result, loading (i.e., fetching from storage, deserializing, verifying) the same module with many dependencies in parallel is computationally expensive. For instance, if Block-STM uses 32 threads and each thread tries to load the same module and its dependencies, in the worst case each module will be loaded 32 times.

Module caches are used by a single block of transactions and then discarded. This is because the lifetime of the module cache is bounded by the lifetime of the VM instance used by Block-STM. Hence, if transactions across blocks use the same smart contract, it will be loaded for each block. This is particularly bad for hot contracts which are accessed frequently, e.g., the Aptos framework.

Figure 2: Example when executing transactions speculatively in parallel may lead to non-deterministic behavior. Block-STM detects these cases and executes transactions sequentially.

Block-STM may execute a block of transaction sequentially if smart contract code is upgraded. Because Move smart contracts on the Aptos blockchain are upgradable, it is possible that some transactions in the block publish new versions of modules, e.g., adding new functions or changing existing implementations. Unfortunately, the legacy loader architecture simply does not work with the speculative parallel execution employed by Block-STM. This is because one of the thread (T2) can read speculatively published code by some other thread (T1), load it and cache in T2’s cache (Figure 2). But if Block-STM detects that T1 should have a different execution outcome such that the module is not published, and re-executes T1, the loader cache of T2 is no longer consistent: it stores the version of the code that should not have been published. This behavior results in surprising user errors, and in the worst case — non-deterministic execution results amongst validators. In order to mitigate this problem, Block-STM fallbacks to sequential execution for blocks where smart contracts are upgraded and are accessed at the same time. As a result, performance regresses by a factor of magnitude in this case.

The solution

Loader V2 is a complete redesign of the legacy loader. The most important part of the new implementation is that the loader and its cache are no longer part of the Move VM, making it possible to share caches between threads. In practice, this also means that in order to execute a Move smart contract in the VM, one has to also provide a loader to it, as shown in Figure 3.

Figure 3: Loader V2 architecture and how it fits into Move VM and Block-STM execution. Originally, L3 contains module 0x1::x. In block 1: (1) the first transaction reads 0x1::x from L3 cache, and records it in local L1 cache; (2) the first transaction reads 0x1::x again, now from L1 cache; (3) the second transaction reads 0x1::y, which is a miss in L3 cache and a miss in L2 cache; (4) the module is take from storage and cached in L2 cache; (5) the third transaction reads module 0x1::y from L2 cache (there is a miss in L3 cache again). When execution of block 1 is completed, cached entries from L2 are moved to L3: (6) module 0x1::y is moved to L2 cache. Then, a second block is executed. In block 2: (7) the first transaction reads 0x1::y from L3 cache, and copies the read module into local L1 cache; (8) the second transaction writes the new version of 0x1::y, invalidating the entry in L3 cache and adding the new one to L2 cache. Subsequent accesses to 0x1::y will be resolved to 0x1::y in L2 cache from now on.

The module cache used by the loader V2 can be used across multiple blocks and has 3 tiers.

  1. L3: A global lock-free module cache with the lifetime of up to an epoch (up to 2 hours on Aptos mainnet at the time of writing). Flushed only at epoch boundaries, when the node configuration changes, or the memory usage is above a certain limit.
  2. L2: A concurrent module cache used by a single block of transactions. When transactions access modules that do not exist in L3 cache, these modules (and its dependencies, if loaded) are placed in L2 cache instead. When block execution terminates, entries from L2 cache are promoted to lock-free L3 cache. As a result, future accesses to these modules will benefit from no synchronization overhead.
  3. L1: A thread-local module cache used by a single transaction execution. When transaction (Move VM) reads a module from L2 or L3 cache, it stores its copy (a pointer, to be precise) in L1 cache. This way future reads of the same module are lock-free, and resolve to the same module.

Next, let’s see how these tiered caches work with module publishing.

When transaction writes data, Block-STM records its writes in a multi-version data structure. Writes can invalidate speculative executions of other transactions, which in case of invalidation, are scheduled for re-execution, producing new versions of writes.

In the new design, for transactions that publish code, module writes are not made visible to other transactions in the end of the execution. Instead, Loader V2 relies on the rolling commit mechanism of Block-STM, making module writes visible only when the transaction gets committed (i.e., it is guaranteed that it can no longer be re-executed). The benefit of this approach is that module information does not need to be versioned: module cache need to store only the most recent version of the module.

When modules writes are made visible, a few things happen:

  1. Corresponding entry in L3 cache is marked as invalid.
  2. Published module is put into L2 cache.
  3. Transactions following the committed transaction are re-validated, and re-executed if they read the old module version. As a result, affected transactions will eventually read the newer version of the code and execute transaction using the upgraded version.

Note that steps (1) — (2) are here to optimize for the common case: read-intensive workloads. Adding new modules to L3 cache directly would require locking and synchronization primitives, introducing extra overhead on every access.

The results

Loader V2 is described in AIP-107 and the change is included in 1.26 release of the Aptos node binary. Loader V2 has been recently enabled on the Aptos mainnet.

Backwards-compatibility

Loader V2 is almost fully backwards compatible with the legacy loader implementation. However, there are a few cases throughout transaction history on the Aptos mainnet and testnet networks where the execution behavior diverges. We analyzed all cases of divergence and concluded that the historical behavior was not intended, and is the result of poor implementation of the legacy loader. For example, we found instances where historical transactions failed because the legacy loader was incorrectly using older versions of code. With Loader V2, these transactions use correct versions of the code resulting in different execution outputs. Clearly, maintaining compatibility with the legacy implementation in this case is out of question (and strictly speaking, is not even possible).

Performance

Originally, Loader V2 was evaluated on a set of artificial benchmarks, measuring the throughput of a single node when executing different workloads such as transfers, module publishes or NFT mints. We observed that for workloads that publish modules, around ~10x speedup in throughput is observed.

However, the existing benchmark suite does not have a diverse set of different modules like the real network, and was not representative enough for the common case: read-heavy transactions. As a result, an experiment was conducted on the Aptos mainnet.

Because loader V2 is almost fully backwards compatible, it was possible to enable it on one of the validator nodes run by Aptos Labs. When enabled, we compared the average block execution times between the node using Loader V2, and the other nodes, as shown in Figure 4.

Figure 4: The average block execution time (in seconds), for 8 different validator nodes. The mainnet-validator-usce1–0-aptos-node-validator-0 node has Loader V2 enabled (bottom line). Other nodes use the legacy loader.

With Loader V2, the average execution block time was significantly reduced, sometimes by up to 60%. Such speedups become possible due to long-living (L3) cache used by the new loader implementation.

Conclusion

Loader V2 is a completed redesign of Move VM’s legacy loader, responsible for code loading and caching. The new implementation has been moved outside of Move VM, making it stateless and thread-safe. Loader V2 is integrated into Block-STM, allowing it to execute blocks of transaction that contain module publishes in parallel. Long-living caches for loaded and verified Move smart contracts have also been introduced, drastically improving block execution times on the diverse set of real workloads.

--

--

Aptos Labs
Aptos Labs

Written by Aptos Labs

Aptos Labs is a premier Web3 studio of engineers, researchers, strategists, designers, and dreamers building on Aptos, the Layer 1 blockchain.

No responses yet