Optimizing the Internet Computer Memory System’s Performance

Published in

The Internet Computer Review

6 min readOct 21, 2021

The growth of canister smart contracts on the Internet Computer has led to substantial improvements in query and update call performance.

By Ulan Degenbaev, Principal Software Engineer | DFINITY

The Internet Computer blockchain makes it possible for the world’s systems and services to be created using advanced canister smart contracts that run entirely on-chain. Canisters enable autonomous service composability that can drive extraordinary network effects, allowing practically any enterprise to be fundamentally reimagined. Since network Genesis in May, the DFINITY Canister SDK has been used to create thousands of canister smart contracts on the Internet Computer, many of which are complete Web3 dapps.

The rapid growth of canisters and users on the Internet Computer blockchain presents interesting engineering challenges. A recent increase in memory-intensive canisters demonstrated that the memory system had a performance bottleneck under heavy load. This blog post describes performance optimizations from NNS proposal 20461, providing details about scaling and WebAssembly (Wasm) memory.

Results

The optimizations were incrementally deployed to all Internet Computer subnets after the adoption of NNS proposal 20461 on Sept. 14. Figures 1–3 show the impact of the optimizations on one subnet under heavy load at the time of the upgrade. You can see two main improvements:

Increased and more stable finalization: The choppy finalization rate recovered from 0.5 blocks per second to the expected level of 1 block per second.
Improved message execution time: The average message execution time improved by ~3x and the maximum improved by ~10x.

Figure 1. Block finalization rate before and after the rollout of the optimizations. The red line shows the time when the replica was restarted after the upgrade.

Figure 2. The average message execution duration before and after the rollout of the optimizations.

Figure 3. The maximum message execution duration before and after the rollout of the optimizations.

Background: orthogonal persistence

There are two types of messages that a canister can receive and execute: queries and updates. Queries are read-only in the sense that all modifications that a query performs in the Wasm memory are discarded after execution. In contrast, a successful execution of an update message automatically persists all memory changes and makes them available to subsequent update messages and queries. This concept is known as orthogonal persistence.

Any implementation of orthogonal persistence has to solve two problems:

How to map the persisted memory into the Wasm memory.
How to keep track of all modifications in the Wasm memory so that they can be persisted later on.

The current implementation uses page protection to solve both problems. When a message starts executing, we divide the entire address range of the Wasm memory into 4KiB chunks called pages. Initially, all pages are marked as inaccessible using the page protection flags of the operating system. This means that the first memory access triggers a page fault, pauses the execution, and invokes our signal handler. The signal handler then fetches the corresponding page from persisted memory and marks the page as read-only. Subsequent read accesses to that page will succeed without any help from the signal handler. The first write access will trigger another page fault, however, and allow the signal handler to remember the page as modified and mark the page as readable and writable. All subsequent accesses to that page (both read and write) will succeed without invoking the signal handler.

Invoking a signal handler and changing page protection flags are expensive operations. Messages that read or write large chunks of memory cause a storm of such operations, degrading the performance of the whole system. This is the performance bottleneck observed under heavy load. Note that the signal handler was written before the launch of the Internet Computer, and its main priority was correctness and not performance.

Background: concurrent query execution

A canister executes update messages sequentially, one by one. Queries, in contrast, can run concurrently to each other and to update messages. The support for concurrent execution makes the memory implementation much more challenging. Imagine that a canister is executing an update message at block height H. At the same time, there could still be a long-running query that started earlier at block height H-K. This means the same canister can have multiple versions of its memory active at the same time.

A naive solution to this problem would be to copy the entire memory after each update message. That would be slow and would use a lot of storage. Thus, the current implementation takes a different route. It keeps the modified memory pages in a persistent tree data-structure called PageDelta that is based on Fast Mergeable Integer Maps. At a regular interval (i.e., every N rounds), there is a checkpoint event that commits the modified pages into the checkpoint file after cloning the file to preserve its previous version. Figure 4 shows how the Wasm memory is constructed from PageDelta and the checkpoint file.

Figure 4. a) The checkpoint file stores the Wasm memory at the last checkpoint. b) Pages modified since the last checkpoint are stored in a persistent data-structure called PageDelta. c) The Wasm memory is constructed lazily by the signal handler by copying the checkpoint file pages and modified pages.

Optimization 1: memory-mapping the checkpoint file

The first optimization is to memory map the checkpoint file pages. This reduces the memory usage by sharing the pages between multiple messages running concurrently. This optimization also improves performance by avoiding page copying on read accesses. The number of signal handler invocations remains the same as before, so the issue of signal storms is still open after this optimization.

Optimization 2: page tracking in queries

All memory pages that a query modifies are discarded after execution. This means that the signal handler doesn’t have to keep track of modified pages for queries. But the old implementation of the signal handler did not distinguish between update messages and queries. We introduced a fast path for queries that marks pages as readable and writable on the first access. This low-hanging fruit optimization made queries 1.5x-2x faster on average.

Optimization 3: amortized prefetching of pages

The idea behind the most impactful optimization is simple: If we want to reduce the number of page faults, then we need to do more per signal handler invocation. Instead of fetching a single page at a time, the new signal handler tries to speculatively prefetch more pages. The right balance is required here because prefetching too many pages may degrade performance of small messages that access only a few pages. The optimization computes the largest contiguous range of accessed pages immediately preceding the current page. It uses the size of the range as a hint for prefetching more pages. This way the cost of prefetching is amortized by previously accessed pages. As a result, the optimization reduces the number of page faults in memory intensive messages by an order of magnitude.

Conclusion

The original signal handler was written before the launch of the Internet Computer focusing on correctness over performance. It was not surprising that this area would need to be optimized for performance. The rapid growth of the Internet Computer required optimizations much earlier than expected, however. These optimizations removed one bottleneck, but more bottlenecks may be discovered as the Internet Computer continues to grow. If you are interested in working on performance — DFINITY is hiring!

Acknowledgments

Thanks to Akhi Singhania, Alin Sinpalean, Dimitris Sarlis, Dominic Williams, Johan Granström, Kiran Joshi, Roman Kashitsyn, Saša Tomić, Stefan Dietiker, Stefan Kaestle for discussing ideas and reviewing code changes. Also thanks to Diego Prats for reviewing this blog post.