Aptos

A Layer 1 for everyone.

Shardines: Aptos’ Sharded Execution Engine Blazes to 1M TPS

Aptos Labs
Aptos
Published in
14 min readFeb 5, 2025

--

By Manu Dhundi, Sital Kedia, Igor Kabiljo, Zhoujun Ma, Jan Olkowski

TLDR:

  • We are excited to unveil Shardines, the next evolution of blockchain execution engines designed to achieve horizontal scalability and more than 1 million transactions per second (TPS). This achievement is a significant milestone for the blockchain industry, proving that our architecture is capable of meeting the growing demand for Web3 applications and supporting global-scale innovation.
  • Blockchain execution engines can scale independently of consensus and storage. This enables validator operators to optimize resources and enhance system flexibility.
  • Shardines uses a cutting-edge dynamic partitioner to minimize cross-shard network communication and wait times, adapting seamlessly to evolving workloads.
  • An innovative ‘micro-batching and pipelining’ strategy further mitigates the impact of network latencies by amortizing the costs, ensuring high performance.
  • Aptos remains dedicated to providing open-source and verifiable performance results, fostering transparency in Web3. We have open-sourced a fully reproducible benchmark of Shardines for blockchain enthusiasts to verify the results.

Scalability has long been a critical challenge for blockchains. As Web3 adoption grows and use cases expand, traditional blockchains often become bottlenecked on throughput (measured in transactions per second) leading to high latency and gas cost. First-generation blockchains like Ethereum pioneered Web3 by employing a single-threaded execution model, which limited the throughput to just 15 TPS, and that proved to be a major bottleneck. To tackle Ethereum’s scalability challenges, next-generation blockchains like Aptos and Solana adopted parallel execution models. On Aptos, we have Block-STM — a highly efficient, multi-threaded, in-memory parallel execution engine that leverages Software Transactional Memory techniques with optimistic concurrency control. Block-STM has set the industry standard for parallel transaction execution, inspiring several other chains to adopt similar approaches.

Today, we are excited to unveil Shardines, the next evolution of blockchain execution engines designed to achieve horizontal scalability. With Shardines, we can achieve near-linear throughput scaling, surpassing 1 million TPS on our 30-machine cluster. This milestone is not the end but a critical step forward, showcasing a blockchain architecture ready to support global-scale applications and the evolving demands of Web3.

Scaling Blockchains

A typical blockchain stack consists of three main components: consensus, execution, and storage.

  • The consensus layer receives incoming transactions, and is responsible for ensuring that all nodes within the network agree on a particular order of transactions.
  • The execution layer takes the current state and the incoming transactions agreed upon by the consensus, and is responsible for processing smart contracts and executing transactions.
  • The storage layer is responsible for persisting all of the data associated with the blockchain, including the state of the ledger and any associated smart contract data. It provides the current state to the execution, and updates the state based on the execution results

To achieve horizontal scalability, our vision is to shard and scale these components independently. That is, consensus can be scaled by increasing the number of data dissemination service shards, which are independent of the number of execution and storage shards. Similarly, if execution becomes a bottleneck, it can be scaled by adding more CPU cores without the need to increase storage capacity and data dissemination service capacity. Furthermore, these sharded services are designed as logical services, decoupled from the physical machine layout. This decoupling allows validator operators to flexibly choose machine types, optimizing resource use while enhancing system elasticity.

Independently Scalable Blockchain Architecture

Consensus

Our implementation of Narwhal-based Quorum Store has shown how to scale out consensus by increasing bandwidth by adding more nodes. Quorum Store decouples data dissemination from metadata ordering, which allows data dissemination to be scaled horizontally. The above architecture shows multiple data dissemination shards responsible for disseminating transaction data to other validators and obtaining the proof of store certificate. The consensus coordinator is responsible for ordering the metadata — and since the metadata is orders of magnitude smaller than the transaction data, this allows consensus to scale to support very high throughput without becoming a bottleneck.

Storage

Aptos uses the Jellyfish Merkle Tree (JMT) design to store the blockchain state. JMT is a space-and-computation-efficient sparse Merkle tree optimized for Log-Structured Merge-tree (LSM-tree) based key-value storage like RocksDB. At a high level, sharding the storage entails sharding the JMT for the state storage. We have implemented sharding of JMT data structure and recently deployed sharding within a single node to production in mainnet. Since the design of the JMT ensures that the key-value pairs are uniformly distributed, this sharding technique ensures that the load is evenly distributed across different shards.

Sharding a Merkle Tree

Execution

Execution is inherently CPU-bound, placing strict limits on the TPS achievable through scaling up a single machine due to the finite number of available cores. As a result, scaling out across multiple machines becomes the only practical solution.

Most existing blockchain sharding solutions, including L2 rollups, tightly couple execution and storage together on a single machine. While this approach appears to offer strong scalability, it creates a sub-par user experience and fragmentation problem as there is no single shared state. Our approach to scaling execution involves decoupling it from storage, allowing both to scale independently. In fact, we believe that in any real-world sharded blockchain system, it will be impossible to get 100% execution ←→ data colocation. There will always be a cost to fetch data that is not co-located with execution. To address this, we employ a dynamic partitioner and a novel ‘micro-batching and pipelining’ strategy (both discussed in detail later) that amortizes the cost of remote storage reads, ensuring efficient performance.

Handling Conflicts in Sharded Execution

One of the major challenges in scaling execution for blockchains is handling write-read conflicts across transactions. Write-read conflict occurs when one transaction in a block (say T2) reads a state that is modified by another transaction (say T1) in the block.

Conflicting Transactions

Transactions that conflict on a shared state pose significant challenges to performance, as they cannot be executed in parallel. Access to the shared state must be serialized to maintain consistency, effectively creating a bottleneck that slows down execution. The problem becomes even more pronounced when these conflicting transactions are distributed across different shards (nodes). In such cases, cross-shard communication is needed to coordinate serialized access to the shared state, adding latency and complexity to scaling-out execution.

A naive approach to handling conflicts is to create blocks containing only non-conflicting transactions. However, this is not practical if there might not be enough non-conflicting transactions available to fill the blockspace. More critically, it is unfair to users who pay higher gas fees to have their transactions on-chain. For instance, during a popular NFT mint, many high-fee conflicting transactions may demand immediate inclusion on-chain.

Conflicting transactions are an inherent aspect of any widely adopted blockchain like Aptos. Recognizing this, we design our system to treat conflicts as a first-class concern. On a single-node executor, Block-STM efficiently manages conflicts, while for horizontally sharded execution, we minimize cross-shard conflicts through a hyper-graph partitioning approach described below.

Partitioning Transactions

When there are write-read conflicts across two transactions assigned to different shards (nodes), an additional network hop is required, where the updated value is transmitted over the network, causing the later transaction to wait before it can begin execution. The primary goal of partitioning is to minimize this cross-shard communication and the associated wait times.

The most direct way to reduce cross-shard communication is to partition transactions such that each resource is needed on the least amount of machines and then reorder transactions to remove any need of sending resources back and forth between two nodes, as well as reducing the wait times. In order to do that, we model the conflicts as a hypergraph, where nodes are transactions and each resource is a hyperedge connecting all transactions that access or write to it. In the example below, we draw the hypergraph as a bipartite graph, with transactions on one side and resources on the other.

Partitioning Transactions Across Shards

In such a graph we want to minimize a metric called fanout — which represents the number of shards that a hyperedge spans (in our case number of shards a resource is needed on). Resources with fanout (f) = 1 do not require coordination across shards, while others do. And as the fanout of a resource increases, so does the number of shards that need to coordinate for that resource. To accomplish this, we pick and implement the strategy from ‘Social Hash Partitioner’, because it is fully scalable, while maintaining high quality. It smoothens the objective function by generalizing the fanout into a “probabilistic fanout”, allowing simple local search algorithms to find high quality solutions. Finally, we use a single-pass heuristic to determine a transaction order that minimizes back-and-forth communication between shards, as well as corresponding wait times, producing optimized order and shards for execution.

Shards produced by this approach are dynamic, that is, the shards from which resources are accessed varies from one block to the next. This results in high throughput across changing workloads.

Shardines Design

High-Level Design of Shardines

As depicted in the above diagram, transaction execution is scaled out across multiple nodes, each functioning as an ‘Executor Shard’. Each shard runs an instance of Block-STM to efficiently execute the transactions assigned to it. The number of executor shards is configurable and can be set based on the anticipated workload. An Executor Shard receives transactions for execution and makes requests to a remote storage service to get the data required to execute the transactions. For transactions with data dependencies on the transactions in other shards, known as cross-shard dependencies or conflicts, shards exchange cross-shard messages to get the updated data. After completing the execution, the executor shard sends back the results to the coordinator.

To demonstrate the capabilities of our sharding design, we abstracted out other parts of the system like consensus (block generation), partitioning, read-write set generation, storage, aggregation and the benchmark in one application on a single node called ‘Execution Coordinator’. Once the block generator generates the transaction blocks, the coordinator partitions the transactions within a block and dispatches the partitioned transactions to their respective executor shards for execution. Once the shards complete execution, the coordinator collects the results, which includes updated states and metadata, for subsequent processing such as aggregation and finalization. It finally updates the storage service with the outputs, ensuring consistency and durability.

In this design, the storage service is not sharded across multiple nodes, and the coordinator hosts the remote storage service. At 1M TPS, the storage service on a single node becomes the bottleneck. We plan to address the storage bottleneck in the future.

Extracting Transaction Dependencies

Predicting the read-write sets for the transactions is critical for our partitioning approach. They are generated by doing the static analysis at runtime, more specifically at the time of bytecode verification. For the scope of our experiments, we only considered transactions where read-write sets can easily be inferred by looking up the sender and receiver params of the transactions. We are building a static analysis framework for extracting transaction dependencies in run-time, which will work for all types of transactions.

Pipelining

To address performance bottlenecks caused by network delays, we employ a novel approach of micro-batching and pipelining throughout the entire execution flow. The idea is to split transactions in a block into micro-batches and pipeline them through various stages of executions. These stages of executions are as follows:

  1. Sending a micro-batch of transactions from the coordinator to a shard
  2. Pre-fetching the data and the dependencies for the transactions in the micro-batch
  3. Block-STM executes the transactions on shards
  4. Send the execution results to shards containing dependent transactions as required
  5. Sending the results of the micro-batch back to the coordinator
Microbatching and Pipelining Transactions During the Execution

The execution involves five stages, requiring a total of 5 network hops per batch. Micro-batching mitigates delays caused by serializing and deserializing large data chunks, as well as network delays from transmitting large packets. Additionally, combining pipelining with micro-batching ensures efficient resource utilization by overlapping the network delays of individual micro-batches with execution tasks. Hence, in standard data center environments with single-hop latency below 1 millisecond, the total network delay across multiple batches is effectively minimized to around 5 milliseconds (5 network hops) per block (instead of 5 network hops per batch). This strategy enables efficient processing even when data and execution are not colocated, maintaining high throughput.

Aggregation of Results

An interesting challenge of scale-out execution to shards is the need for aggregation. Some resources, like total supply of the gas token, are modified by all the transactions. If these resources are included in the conflicts graph, every transaction would conflict, forcing sequential execution across shards. To address this, each transaction computes its delta for the total supply during execution on its respective shard, and these deltas are aggregated at the end to update the total supply in the outputs. This aggregation process is also pipelined, contributing only a negligible overhead to the overall execution time.

Aggregation of Execution Results from Shards

While it may seem that the aggregator is completely serialized, the pipelined architecture ensures that there is no noticeable overhead of aggregation.

Results

Setup

Shardines’ architecture demonstrates the ability to scale blockchain execution throughput beyond 1 million transactions per second (TPS) for non-conflicting transaction patterns. To achieve this, we abstracted the non-execution components (like block generation and storage) and implemented the benchmark in a single-node application called the execution coordinator. By scaling up the coordinator, we ensured that horizontal scaling of execution was minimally impacted by these abstracted components. The sharded execution time is the time it takes to distribute transactions to the executor shards, execute the transactions, collect the results, and then perform any additional processing that may be required.

The experimental setup included the following specifications:

  • Executor Shard Nodes: T2d-60 on GCP with 32 Gbps bandwidth, 60 vCPUs (physical cores) running at 2.45 GHz on AMD Milan, 240 GB RAM.
  • Coordinator Node: A NUMA machine with 360 vCPUs (180 vCPUs on a single NUMA node used), 100 Gbps bandwidth, AMD Genoa processor, 708 GB RAM, and 2 TB SSD persistent disks.

Workloads

Shardines’ scalability was evaluated using two distinct workloads: one simulating a workload without conflicts and another simulating a workload with conflicts. These workloads show how the architecture handles various transaction patterns efficiently at scale.

Workload Without Conflicts

To evaluate the scale-out capabilities of our architecture with minimal overhead, we designed a workload consisting of transactions that:

  • Update the account state by incrementing the sequence number, adjusting the balance, and modifying the total supply of the gas token.
  • Emit an event.

These transactions update the state of a single account, making them inherently non-conflicting with other transactions from different accounts.

We tested this workload across 50 blocks, each containing 500,000 transactions. The results demonstrated linear throughput scaling, achieving approximately 1.033 million transactions per second (TPS) with 30 shards. At the shard level, performance remained consistent at ~40,000 TPS per shard across most configurations. With 30 shards, there was a slight dip to around 35,000 TPS per shard, reflecting the overhead introduced by increased coordination.

These results highlight the efficiency of our scale-out architecture in handling high-throughput workloads with minimal performance degradation as the system scales.

Conflicting Workload

Blockchains have a vibrant ecosystem of multiple decentralized applications (DApps) catering to various use cases such as gaming, finance, social networking, and supply-chain management. Within a block boundary, users typically interact with a single “main DApp” aligned with their primary interests, while interactions with multiple DApps occur much less frequently. This workload pattern creates a graph where users (via their transactions) are mainly clustered around resources of a single DApp and few resource accesses going cross-cluster.

To simulate such a workload, we generated a block containing 500,000 transactions across 400 DApps, with each DApp managing 5 resources. The transaction distribution followed a normal pattern across DApps, while the number of transactions per user was log-normally distributed with a mean of 5. Each user had a primary DApp, with a 99.9% probability of accessing resources exclusively within that DApp. This means that most users interacting with the same DApp will have transactions that are conflicting — writing to and reading from the same DApp resource.

Across 50 blocks, we exceed 500,000 TPS. While throughput scaling to 30 shards is not perfectly linear, it remains impressive. The per-shard TPS decreases as the number of shards increases due to less optimal partitioning, which leads to a rise in cross-shard dependent transactions. We plan to further refine our partitioning algorithm while also exploring alternative approaches to achieve optimal partitioning that enables better linear scaling.

Saturation Due to Unsharded Storage

Interestingly, the execution throughput was eventually constrained not by the execution layer itself, but by the storage layer’s scaling limitations.

In our experimental setup, the execution coordinator manages all non-execution tasks, with storage being the major component. Currently, the storage service running on a single-node coordinator has a limited read/write throughput. This limitation acts as a bottleneck, restricting the horizontal scalability of execution beyond 30 shards. This indicates that the scale-out architecture for execution has room for further growth. Future improvements will focus on scaling out the storage layer to push these boundaries further.

Reproducible Open Source Benchmark

Aptos is committed to delivering open-source and verifiable performance results, promoting transparency in the blockchain world. The horizontally sharded executor code and performance benchmark are fully open source, and we encourage blockchain enthusiasts to explore, experiment, and share their suggestions for improvement.

Future Work

With Shardines, we have demonstrated that horizontal scalability is attainable on Aptos, achieving a remarkable milestone in blockchain performance. We’ve already hit 1 million TPS using 30 shards, but this is just the beginning. By further scaling up single node Block-STM performance and scaling out beyond 30 shards, Aptos is poised to unlock even greater transaction processing capabilities and redefine the boundaries of blockchain performance.

Looking ahead, we will focus on horizontally scaling the storage layer to eliminate current bottlenecks and unlock even greater throughput. Orthogonally, as part of BlockSTM V2, we will also work on optimizing both single threaded (AptosVM) as well as single node (BlockSTM) performance. Furthermore, we are committed to bringing end-to-end horizontal scalability to the Aptos mainnet with an architecture that supports the independent scaling of consensus, storage, and execution layers.

Welcome to a paradigm shift in the future of blockchain technology as we know it.

--

--

Aptos Labs
Aptos Labs

Written by Aptos Labs

Aptos Labs is a premier Web3 studio of engineers, researchers, strategists, designers, and dreamers building on Aptos, the Layer 1 blockchain.

Responses (1)