Open VDF: Accelerating the Nova SNARK-based VDF

Supranational
Supranational
Published in
12 min readJan 25, 2023

Approximately one year ago we announced a collaboration with the Ethereum Foundation, Filecoin Foundation, Protocol Labs, Microsoft, and the Electric Coin Company to improve the performance of SNARKs, and to make SNARK-based Verifiable Delay Functions (VDFs) practical. Over the past year we have collectively made a huge amount of progress towards that goal, including an ASIC implementation of the MinRoot algorithm, implementation of a recursion-enabled proof system (Nova), and support for GPU accelerated proof generation. In today’s post we’ll provide an overview of the current performance of the Nova-based VDF and how it can be further improved.

About Nova

Nova is a novel approach to realizing incrementally verifiable computation (IVC). IVC is a powerful computing primitive that not only enables use cases such as VDFs, but also more general use cases like zkVMs. At a high level, the performance of Nova is dominated by two multi-scalar multiplications (MSMs), each of a size approximately the number of constraints in the circuit. In the case of the MinRoot VDF circuit, each iteration of the circuit requires 3 constraints. For the purposes of this article we will focus predominantly on the performance of computing these MSMs. However, it should be noted that the other computations required for generating Nova proofs become increasingly important as MSM performance improves.

Current CPU and GPU Performance Estimates

To first evaluate the performance of a Nova VDF we can begin by taking a look at the performance of MSM on modern CPU and GPU hardware. The pasta-msm repository supports both CPUs and GPUs, and contains a simple benchmark that can be run using `cargo bench` to estimate the MSM performance of your system. Below are the results for two common hardware platforms:

As we can see, GPU acceleration offers a significant speedup in MSM over the CPU hardware found in many desktops and laptops today.

Next, given that each iteration of the Nova MinRoot VDF requires two MSMs, each of size 3, we can estimate the number of VDF iterations that can be proven per second given the above MSM performance.

Fortunately, rather than rely on theoretical calculations, we are able to test the Nova VDF implementation directly by running `cargo run — release — example minroot` on the Nova repository:

While the actual CPU performance is near what was expected, the GPU performance falls substantially below the estimated performance due to Amdahl’s law. Digging in deeper, let’s look at the time spent on the MSM operations in comparison to the other ancillary operations:

Alternatively, we can look at the relative time spent in each of these categories between the two platforms:

The above illustration shows that even if MSM were entirely eliminated from the computation, the largest speedup that could be achieved over the CPU baseline is ~6.3x (100% / 16% = ~6.3x) without tackling the additional operations.

Future GPU Performance Estimates

After substantially improving the performance of MSM it is clear that the ‘other’ operations now take up the bulk of the computation time and need to be accelerated through optimization and parallelization. For the purposes of this blog we will assume that the various ‘other’ operations can be moved to the GPU and sped up by a factor of 10x. Once this is done we will once again find ourselves in the position where MSM is the dominating computational cost. While our previous simple benchmarks suggested a GPU can perform an MSM of size ~28M in one second, planned optimizations to the `pasta-msm` kernel will increase performance beyond 112M bases/second, a further 4x improvement, and would result in the following anticipated performance:

VDF Speed and Required Proving Hardware

Now with a rough roadmap for the future GPU performance of Nova, we can start to evaluate if it will be sufficient to meet the needs of the VDF use case. To do this we must first understand the speed of the VDF evaluation to understand how many CPUs or GPUs we will need to generate the associated proofs. Below are estimates of the anticipated performance for a MinRoot VDF evaluator across a variety of platforms:

A brief look at the table above shows that there is a reasonable path to making the Nova VDF possible both today, and in the future, by using one or more GPUs, depending on how the amax parameter is set.

Making the VDF User Friendly

While the above table shows that there is a reasonable path to making the Nova VDF possible, another goal of the VDF project is to make running the VDF practical. Building a system with several GPUs can not only be expensive, potentially costing thousands of dollars, it will also generate noise and heat while consuming over a kilowatt of power. With U.S electricity prices averaging around $0.16/kWh, the cost of running such a machine could cost over $1400 per year. In order to improve the cost and usability of the VDF system there are two obvious options:

  1. Improve the algorithmic performance of the Nova proof system through “academic” advances such as improved improved proof system performance or reduced number of circuit constraints
  2. Move the VDF proving process to alternative hardware architectures that offer better unit costs and power consumption

While this collaborative effort is pursuing both of these options, in this this blog post we’ll focus on item #2, by evaluating the potential cost and performance of an ASIC based architecture.

ASIC Background

ASICs are chips that are developed for a particular use. In this case, that use is generating Nova VDF proofs. In general, ASICs trade off programmability, like that available in CPUs, GPUs, and FPGAs, for higher performance, lower unit costs, and lower power consumption. In the case of the Nova VDF, the ASIC will not provide any net new capabilities that couldn’t be previously performed, but rather its purpose is to reduce the cost and improve the usability of the VDF system. In order to assess these benefits, we will compare the power, performance, and cost characteristics of a proposed ASIC to the anticipated GPU proof performance characteristics. Once again, while there are a variety of components that are required for Nova proof acceleration, for the purposes of this article we will discuss the core computational element, multi-scalar multiplication.

Potential MSM Accelerator Limitations

Before analyzing the potential performance of an ASIC, it is important to understand the limitations that can inhibit the maximum potential performance of an accelerator. In general there are four main constraints on accelerator performance:

  1. Compute — Current platforms like CPU and GPU can be compute limited when performing MSM. This means that while there is sufficient bandwidth, storage, and power, there are not enough computational units to saturate these other limitations. For instance, multiplication arithmetic logic units (ALUs) are a core compute resource required for computing MSMs and CPUs, GPUs, and FPGAs have a limited number of these ALUs.
  2. Bandwidth — When designing an ASIC it is possible to include as many ALUs as desired. Once the ASIC has a sufficient amount of computation capabilities, the next limitation is ensuring that enough data can be sent to the ALUs to keep them working. MSM calculations require a large amount of data to be worked on, namely the scalar and point pairs. This data needs to be transferred from the host to the accelerator so data movement over interfaces such as PCI express (PCIe) can become the limitation. Furthermore, once the data is moved to the accelerator, bandwidth may also be needed to move data from on-accelerator storage resources such as SRAM or DDR, to the compute units.
  3. Storage — In order to reduce the bandwidth requirements to and from the ASIC it is often useful to be able to store data on the MSM accelerator chip or board. In particular for Pippenger’s algorithm, an algorithm used to compute an MSM, it is useful to store the points, scalars, and buckets directly on the device. Storage requirements will grow as the size of the MSM increases, as the field sizes increase (scalars and points), and as the bucket size increases.
  4. Power — Assuming the ASIC is designed with enough compute units to saturate the bandwidth and storage of the accelerator, the next bottleneck that can present itself is power, and power density. Power must be provided to the accelerator and heat must be removed to ensure continued operation of the accelerator. A bandwidth-limited ASIC may require more ALUs than can be effectively powered and cooled.

Theoretical Performance Limits — Bandwidth

Host <> Accelerator Interface
Current platforms such as CPU and GPU are not currently limited by bandwidth when computing MSMs, but rather insufficient amounts of ALUs, and sometimes, by power and cooling constraints. The goal of an MSM accelerator would be to provide a sufficient number of ALUs such that bandwidth becomes the limiting factor. Below is a table showing the maximum possible throughput from a host to an accelerator when limited by external bandwidth:

Accelerator Internal Computation Bandwidth
While the table in the above section shows the maximum amount of base scalar pairs per second that can be transferred to the accelerator, additional bandwidth is required to perform the MSM. For example, when using Pippenger’s algorithm, each base needs to be accessed n number of times, where n is the number of bits used in a window. For the purposes of this article we will assume a window size of 16 bits and that the scalar can be stored in SRAM. Below is a table showing the ‘internal bandwidth’ requirements of a MSM accelerator, corresponding to the maximum performance estimates shown in the table above.

The amount of bandwidth required for these configurations greatly exceeds the maximum throughput of PCIe v4 (32GB/s) and therefore the MSM accelerator would require either SRAM, off-chip memory (DDR/HBM), or both, to meet the computational bandwidth needs.

Multiplier ALU

With an understanding of the maximum amount of data that can be sent to the accelerator, we can begin to look at how many compute units would be needed to process the data. The core arithmetic building block of computing MSMs is finite field multiplication. Below is a table showing example characteristics of a 256-bit and 384-bit finite field multiplication ALU:

Storage Requirements

As mentioned above, storage will be required on the accelerator in order to reduce the amount of bandwidth between the host and the accelerator during the computation. Below are the minimum storage requirements for the points, scalars, and buckets assuming a MSM of size 2²⁶, a window size of 16, and 4 windows worth of buckets.

For the purposes of this article we will assume that the buckets will be stored in SRAM, co-located with the ALU, and the points and scalars will be stored in off-chip memory like DDR or HBM. While the total amount of off-chip memory storage required above is achievable (~8GB), additional storage may be required to meet the computational bandwidth requirements of the workload.

ASIC Power and Area

Given the bandwidth limitations, ALU performance, and an assumed additional utilization factor (50%), we can estimate the number of ALUs that would be required in a given ASIC configuration to saturate the interface bandwidth and then derive the resulting power and area:

While many of the above power and area estimates are unlikely to be the optimal point for a single ASIC, the bandwidth of a host system can be maximized by spreading the ALUs across a number of ASICs and/or boards. These separate accelerators can then be connected into a single PCIe interface using a process called “PCIe bifurcation”. For instance, rather than having all of the ALUs on a single chip or board that are required to saturate a PCIe v4 x16 slot, the ALUs could be spread across four PCIe v4 x4 boards and connected into a single PCIe v4 x16 slot. However, it should be noted that PCIe bifurcation is only supported on some motherboards and can add cost and complexity to an accelerator deployment. Furthermore, cooling at a system level with this approach can also present its own challenges.

Example ASIC Architecture

There are many potential approaches to structuring the multi-scalar multiplication operation that trade-off silicon area, PCIe bandwidth, DDR/HBM bandwidth, and power. One approach could look as follows:

In this architecture base points and scalars would be transferred from the host to the accelerator over PCIe. The transfer process could be managed directly by the host or by an on-chip DMA engine. Once on the accelerator, scalars and bases could stream directly into the off-chip memory of the accelerator. The buckets used in Pippenger’s algorithm could be instantiated and stored directly in SRAM co-located with the multiplier ALU. The predictable access pattern means prefetch would be effective at hiding the relatively long latency to off-chip memory. With all the elements in place, the MSM operation would start, processing one or more windows simultaneously depending on the available memory and bandwidth until all windows have been completed. Once this was complete, final accumulation would take place on the host. Individual window results could be streamed to the host on the fly or the entire result could be retrieved upon completion of the full operation.

Accelerator Development and Unit Costs

With an understanding of the silicon area, IP requirements, and board requirements, we can estimate the development and unit costs of a MSM ASIC. Below is a high level estimate of the development costs to design and manufacture an MSM accelerator with on-board memory and PCIe connectivity.

In addition to development costs, below are the assumed unit costs to manufacture the various bandwidth-constrained accelerators. Each accelerator will need to be accompanied by sufficient off-chip memory to meet its storage and computational bandwidth requirements, as well as a cooling solution and enclosure. Factoring in the above information we arrive at an estimated unit cost for the various accelerators:

ASIC vs. GPU Cost Comparisons

After understanding the unit costs of the various ASIC configurations, we can analyze their relative performance against GPU-based acceleration.

As can be seen in the table above, moving to an ASIC-based architecture can increase the performance/dollar by a factor of 5 or more, while simultaneously improving the performance/watt by up to 10x.

Conclusion

SNARK-based VDFs are a promising technology for enabling unbiasable randomness in blockchains and beyond, and are rapidly becoming practical for production use. Furthermore, improvements made to the proof generation performance of Nova will enable not just the VDF use case, but also improved performance for zkVMs, proof-of-storage networks, and more. In a coming post we’ll take a deeper look into the MinRoot ASIC and system design. In the meantime, if you’d like to learn more about the Nova ecosystem check out these great talks:

Nova Overview by Srinath Setty

Nova Whiteboard Session by Justin Drake.

Lurk, a zkVM built on Nova by Chhi’mèd Künzang

Hardware Acceleration of Lurk by Kelly Olson

--

--