Subspace road towards 1M+ TPS

Nazar Mokrynskyi

Published in

WASM conference

12 min readJun 11, 2022

Post Material after WASM conference 2022

Why?

At Subspace we believe Web3 is the future, which in turns means more and more apps will need to be running on chain. While current generation blockchains push the boundaries of what is possible, we need a system that can truly run at Internet scale.

So we need more, much more TPS and on-chain persistent storage capacity to support use cases like Metaverse where quite literally everything is an NFT that needs to have on chain ownership and some assets associated with it. In this article we’ll focus on transaction throughput specifically.

The plan

In this article we’ll start with deep dive into some aspects of WASM Runtime environment that might impact performance both for pallets and for smart contracts, then we’ll dive into challenges specific for smart contracts (pallet-contracts in Substrate).

Once we are done with internals, we’ll look at the blockchain from the other side of the stack at how transactions are scheduled for executions, what opportunities for improvements are there and how we can leverage some of the vertical scalability technics for achieving higher performance on a single chain.

Then we’ll talk about scaling single chain performance horizontally with sharding.

We’ll concluse with some simple math and funny numbers based on public information that will give us an idea of what the performance of the blockchain utilizing some of the described technics might look like.

Overall this article is the result of the extensive research and following upstream work at Substrate and ink! and I hope you’ll find this information useful. Be prepared for a bunch of links with additional reading material if anything is of a special interest for you.

Fundamental limits of physics

Before we dive into how blockchains look internally, we need to understand some of the fundamental limits we can’t bypass on one hand and should strive for on the other.

The ideal blockchain would utilize all of the bandwidth available, work at the network latency propagation delay, would use all of the CPU cores/RAM/GPUs available and would probably be bottlenecked by random access of the disk subsystem in the end.

When looking at existing live blockchains, however, they typically utilize only a fraction of those resources:

coupled block authoring and execution limits bandwidth utilization as increasing block size beyond optimal leads to increase in propagation delay
reducing block time typically leads to increased forking rate and thus security decrease
it is still a big challenge for blockchains to utilize more than one core for transaction processing, let alone take advantage of GPUs capable of processing large amounts of parallel tasks
situation around storage is challenging as well both in terms of preserving the whole history of the chain, which requires a lot of space, as well as performance-wise due to the nature of frequent random reads/writes during smart contract execution

It doesn’t mean, however, we can’t get closer to those limits. One of the papers that describes how to overcome the limitations of vertical scalability and improving latency is Prism: Deconstructing the Blockchain to Approach Physical Limits where researches try to separate different parts of the regular operation of the blockchain (transaction processing, block authoring and block confirmation) into separate chains that can be scaled independently to improve both transaction throughput and confirmation latency at the same time, Scaling Bitcoin by 10,000x.

Wasm Background

WebAssembly is a binary instruction format for a stack-based virtual machine. In short that means you don’t run it directly on the hardware, but rather with an engine that can turn WebAssembly into actual hardware instructions specific to your CPU.

There are a few types of those:

Interpreters (Substrate uses Wasmi)
Ahead-of-Time compilers (Substrate has Wasmtime and Wasmer)
JIT

Interpreter is a relatively simple one of the bunch, it just reads and executes instruction by instruction, meaning that it doesn’t have to do much preparation before it starts running the code, it also has predictable performance, but ultimately performance of it is ~1–3% of the native machine code, not particularly impressive.

AoT compiler will generally result in much more efficient code for the CPU reaching ~50% of native (sometimes higher), but it takes a long time to compile the code due to various optimizations compiler does, meaning there is a significant amount of time (orders of magnitude) comparing to interpreter.

JIT compilers are an interesting mix, they can start interpreting code right away and do various smart optimizations of hot paths and whatnot in the background and swap slower code with faster on the fly.

Here are a few useful links to read about Wasm performance:

Substrate Wasm Runtime environment

In Substrate Runtime is a bunch of code compiled into Wasm blob that implements state transition function, which means it takes state as of previous block and a set of transactions and produces a new state.

For this use case Substrate uses Wasmtime AoT compiler. This is primarily because runtime upgrade (one of the important distinguishing features of Substrate) happens relatively infrequently, thus it is fine to spend some time to get faster binary out of the compiler.

In order to make sure individual blocks can be executed within some known amount of time, weight system is used for extrinsic calls (think transactions) to limit number of extrinsics that can be included per block. This works due to runtime code being considered “trusted”, so it is possible to rely on it to tell that information to the client (part of the node outside of runtime).

Client also caches different versions of the runtime in order to optimize re-compilation and make sure blockchain runs as fast as possible.

For performance reasons and due to WebAssembly sandbox there is a set of host functions exposed to the runtime that allows pallets to manipulate state, spawn other Wasm VM instances, offloading heavy cryptographic and other computations to the native implementation with hardware support.

While Substrate maintainers at Parity are hard working on things like enabling more of WebAssembly features and other low-level optimizations, there is still something you can do to make things faster in your runtime.

First of all, you can introduce more host functions to offload heavy computations, for instance when you verify ZK proof on a specific curve. The thing to remember here though is that host functions are not free: your client will have to support any introduced host APIs basically forever or old runtimes, when syncing from genesis for instance, would break. There are quite a few primitives already available if you use sp-io crate, take advantage of that!

Also benchmark your runtime, sometimes approaches that lead to faster native code like monorphisation could be more expensive than dynamic dispatch in Wasm. Another example is use of reference to values may lead to smaller and faster code instead of passing data structures by value to functions.

Finally, it might be the case that some kind of smart caching combined with Wasmer singlepass compiler for popular contracts (as observed in Ethereum ecosystem) could be used to significantly improving throughput, but more work needs to be done in that direction.

Some of the interesting optimizations on the table in Substrate that will make runtime even faster for you to follow:

Pallet-contracts

This is Substrate’s default envionment for WebAssembly smart contracts (typically written in ink! DSL, but there are other options).

The first challenge with smart contracts comes from the fact that user-provided code can’t be considered “trusted” the same way runtime can, which means:

AoT compilers are not generally applicable
Gas metering is required to prevent DoS instead of static weights

By default pallet-contracts uses slow Wasmi interpreter. Making matters worse is that Wasmi interpreter itself is compiled to Wasm and is running fully inside of the runtime, which doesn’t help with performance as you might have guessed.

There is work ongoing to make use of Wasmer singlepass AoT non-optimizing compiler, which generally is faster at running compiled code, but takes time to prepare Wasm beforehand. In the end the benefit is a wash when end-to-end execution time is measured, but, hopefully, that’ll change in the near future. Follow the latest progress on Wasmer singlepass at https://github.com/paritytech/substrate/issues/10298.

Gas metering is a big topic in any VM and Wasm is not an exception. Most of research easily searchable online compares eWasm with EVM for Ethereum 2.0 upgrade and rarely about WebAssembly smart contracts more generally.

Approaches to gas metering can be generalized into 3 broad categories:

Native to the VM (EVM, Wasmer middleware)
Instrumentation with host calls (Substrate today)
Instrumentation with local counter (NEAR, Substrate from the future)

Native instrumentation is fast and great overall, but both NEAR and Substrate want to be VM-agnostic and need to implement generic solution. The way it works on high-level in both today is to instrument Wasm bundle with injected gas metering code before execution. More specifically this allows interesting optimizations where gas for the piece of code that doesn’t branch can be charged all in one go removing most of the gas metring calls. Where NEAR and Substrate differ however is that in NEAR gas metering is done with a counter local to WebAssembly code and Substrate currently does host calls all the time, which has significant implications described below.

In ink! 3.0 a bunch of smart “lazy” data structures were removed. The idea behind those was to minimize state access by reading only necessary subset of state available to the contract. This is a sound approach on the surface and should result in the faster code, but in practice it turned out that there are two issues with it that led to removal of those data structures:

smart data structures mean a lof of extra code added to the smart contract’s Wasm bundle, which makes it both slower to run that it would otherwise and adds to startup latency
sofisticated data structures added a lot of the extra branching to the code of the contract, meaning gas metering mentioned above happened much more often

In the end, while gas charged might have been lower in some cases, the cost of gas metering itself (host calls) meant that the contract call as a whole ran SLOWER. Yeah, I found that whole situation quite funny to be completely honest. See https://github.com/paritytech/ink/pull/1111 for some of the details.

Starting point

Lets step back from low-level details on how things work internally and how you can make your code faster, what is out baseline anyway?

I was able to find these Tweets from Gavin Wood himself:

https://twitter.com/gavofyork/status/1255859146127179782

https://twitter.com/gavofyork/status/1270025498580656134

I think it is fair to assume that things would only get better since, but we can take those as a baseline anyway and estimate how far off we are, if we just use off-the-shelf Substrate-based node, from the goal of 1M+ TPS:

1_000_000/1_500=666 TPS

Not great, not terrible.

Where is block time spent anyway?

Turns out we can slice block tiem into 3 big chunks:

Block autoring (collecting transactions from transaction pool, executing them, putting block header together, signing)
Block propagation
Block import

Only after newly produced block was imported we can move on to building the next block on top.

Turns out most of the time nodes are not executing transactions…

…but it would have been really nice if they did!

Decoupled execution

In a standard blockchain each full node will:

Propose new blocks
Verify new transactions
Maintain chain history (blocks)
Maintain chain state (result of blocks execution)

In Subspace we apply the technic similar to described in the Prims paper linked earlier, where we split those responsibilities between two distinct roles on the network: farmers and executors.

Two roles in Subspace: farmers and executors

Farmers provide safety to the protocol using novel Proof-of-Archival-Storage consensus and are responsible for proposing blocks and maintaining history. While staked executors provide liveness to the protocol by verifying transactions and maintaining state (doign execution).

So instead of coupling everything together we have the following workflow:

users submit transactions
executors validate transactions and combine them into transaction bundles
farmers include transaction bundle headers into blocks that they produce
executors then deterministically follow farmer blocks and apply state transitions that result in execution receipts
farmers include execution receipts in upcoming blocks, completing the journey of the transaction

Such decoupling allows farmers to produce blocks at their own pace and propose relatively small transaction bundle headers, resulting in low overhead and opening possibilities for mass participation in consensus (we have over 25_000 consensus nodes during stress-test of our non-incentivized test network), while executors can pack megabytes of transactions into bundles and keep their execution engine busy 100% of the time.

This is certainly great for performance! If we assume that block authoring, propagation and import take about the same amount of time, we can increase effective transaction throughput by a factor of 3, leading us to 4_500 TPS. How far off we are from 1M now?:

1_000_000/4_500=222 TPS

This is certainly better than before, but we are still 2 orders of magnitude below target.

For some context here are some numbers claimed (and somewhat theoretical) from already live protocols:

Ethereum: 15 TPS
NEAR: 100_000 TPS (@1M nodes)
Solana: 50_000/710_000 TPS (176 bytes/tx @1 Gbps)

There is still room to grow then.

Cores? Cores!

Modern CPUs are multi-threaded, why not make blockchain multi-threaded too?

https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake

There are two options how transaction processign in blockchains can be made parallelizable that I’d like to cover:

Speculative execution:

Explicit:

A simpler route that both NEAR and Solana took is to change the design of the environment and make it easy for executor to identify which transactions touch non-conflicting parts of the state, such that executor can know precisely which transactions can be run in parallel. In both cases conceptually the state is tied to the account instead of the contract, such that transactions from different accounts often do not overlap. In Solana it is even envisioned that SIMD instructions and eventually even GPUs can be used to run the same contract with multiple accounts (inputs) massively in parallel.

Let’s not be greedy though and assume we can build a simpler system with explicit annotations about which state will be read and written by contract call and we’ll still get 8x improvement, then we get much closer to 1M TPS with 36_000 TPS, but are still quite far from it (and even NEAR/Solana theoretical performance):

1_000_000/4_500/8=27

Horizontal scaling (Sharding)

Once we reached the reasonable limit of vertical scaling all that is left is to apply classical Web2 approach: introduce horizontal scaling with sharding.

In Subspace due to decoupled execution nature of the protocol we can implement nice tricks that do not require sacrifice of safety when shards are added. The insight is that even with multiple shards the history of the blockchains (that farmers store and use as tickets for a chance to produce next block) can be global, meaning history of all individual chains ties together in the end and that global history farmer use for, well, farming. Once they win, solution is sortitioned to one of the shards where they get to produce a block. With farmers effectively solving on all shards at the same time, all shards get the same and full security in a permissionless setting.

Executors, however, are free to choose what shard they decide to stake and execute on using protocol based on free2shard, so they don’t have to rotate often incurring significant overhead.

Together those two aspects allow for a sharding design that isn’t limited in the same way as other sharding designs are and can scale without limits. The working number for calculations is 2¹⁶ shards.

When we take that into account, situation with 1M TPS goal changes significantly:

1_000_000/4_500/8/65_536=0.000_423_855

When we turn it the other way around we get something that is hard to believe without explanation above:

4_500*8*65_536=2_359_296_000

It doesn’t mean that protocol will ever process over 2B TPS. What it means is that it could if there was enough demand for it, otherwise we can have 4.5TPS per shard, 65 shards or some mix of those and still push 1M+ TPS on Subspace Network.