New Paradigm in Designing ZK-ASICs, the zkVM way

Cysic
10 min readApr 9, 2024

We would like to thank Justin Drake and Luke Pearson for insightful discussion.

TL;DR

Real-time ZK proof generation requires end-to-end hardware acceleration. ZKVM not only simplifies the ZK-ASIC design but makes the hardware more performant and cost-effective.

Introduction

Zero-knowledge proof (ZKP) allows a means for one party, called the prover, to convince another party, called the verifier, that a statement is true without revealing any information beyond the validity of the statement itself. As Silvio Micalil (the co-inventor for ZKP, along with Goldwasser and Rackoff) once said as encryption garbles the data, ZKP garbles the computation. More specifically, an encryption algorithm, such as AES or RSA can convert data into the corresponding ciphertext, which hides the underlying data. ZKP converts a computational statement into a proof, which not only hides the details of the computation but validating the statement.

ZKP has become one of the most widely used advanced cryptographic primitives because of two nice properties: zero-knowledgeness and succinctness. Zero-knowledgeness, as we just mentioned, means the proof itself does not leak any information about the computation process as well as the private input. This property is useful to build some privacy-oriented applications, such as Aleo, which unlike Bitcoin, Ethereum and other public blockchains, hides the transaction details. The second property, succinctness, refers to the small size of the proof as well as the short verification time, which means a complicated computation process can be converted into a tiny piece of data, called the proof, and then can be verified almost instantly on a weak machine, such as the cellphone or even Raspberry Pi. This property is enormously useful in scaling Ethereum, where we can convert the EVM computation corresponding to 1000 tx into a tiny proof and then post this proof on Ethereum. If this tiny proof, like 100 Bytes, gets verified on Ethereum, then the whole 1000 transactions are finalized. The private blockchain and scaling solutions are just two examples of the ZKP fever in the blockchain community. There are many other ZK-projects, like ZK coprocessor, ZK bridge and ZK machine learning, built on a combination of these two key properties.

ZK hardware acceleration, the current status

One big obstacle in a wider deployment of ZKP is the huge demand of the computation time and resources in the proof generation process. Usually a more complicated computation demands more in time and resources, such as computation threads and memory. For instance, in a ZKML project by Daniel Kang and his team, the proof generation for GPT-2 inference takes more than 9000 seconds (TODO: add our twitter link) using a powerful 64-thread CPU. On the other side, the proof generation for the ZK-EVM circuit in Scroll requires more than 280GB RAM. Due to these prohibitive resource requirements, the community is pursuing more effective hardware tailored to ZK computation (refers to proof generation). The hardware options are CPU, GPU, FPGA and ASIC, from immediate availability to uncertain waiting period. Among these options, CPU is usually treated as the baseline implementation, which is used to compare against the other three options. There are two common metrics in hardware acceleration, with some intuitive explanation below:

  • Performance per dollar: It means how much users need to pay to purchase this hardware. The purchase decision depends on a lot of reasons, where one of the most important one is to check the maximum performance for the purchase expense. Basically this metric measures the cost-effectiveness of the hardware. In general, if a more advanced process is used to tape out a chip, then we can obtain higher performance, but it tends to be more expensive.
  • Performance per watt: It means how much energy is needed to run this hardware. For instance, the latest Bitcoin miner, T21 from Bitmain, can do 1 terahash using only 19 Joules, better than the products from its competitors.

The advantages of a product mainly depends on the above two factors, plus some non-technical factors, such as warranty and residue value for selling. In general, due to the tailor-made property, ASIC-based hardware outperform the other three in terms of these two metrics. So let’s imagine that a ZK specialized ASIC has been taped out, with much better performance per dollar and watt than existing ones, such as GPUs and FPGAs. This ASIC can support various modules, like multi-scalar multiplication (MSM), number-theoretic transform (NTT), Merkle tree and so on, but how can we integrate this piece of hardware with the current tech stack? The most common approach is to replace the corresponding computation in the CPU code with the acceleration counterparts, which usually cannot achieve satisfying performance speedup. We published our findings on this simple substitution-based approach during ethCC’23 (more details are in this tweet)

CPU and FPGA/CPU Performance Comparison

Compared with CPU performance, we achieved substantial progress using a combination of CPU and our customized FPGA machine, but the performance is still far from the endgame — real time ZK proving. This subpar performance is because of the Amdahl’s law and the interaction cost between different hardware components, where Amdahl’s law means the overall performance improvement gained by optimizing a single part of a system is limited (copied from wikipedia) and the communication cost between different hardware modules worsens the situation. To achieve a much higher speedup over CPU, every possible component needs to be accelerated on a single piece of hardware. However, this seems impossible due to the variety of ZK algorithms (To be clear, ZK algorithms means the computational operations in ZK proof generation.). For instance, the above twitter screenshot shows three ZK circuits, the Poseidon Hash, EVM and GPT-2, for the only the same proof backend (Halo2-KZG), where the computations differ a lot, especially for the witness generation part. The screenshot does not yet include different proof backends, such as Plonky2/3 and Gnark.

The point we want to make here is the hardware needs to be general enough to accommodate various on-chip computational operations for the ZK algorithms. This generality can be done via a hybrid structure of FPGA and ASIC, as we proposed in this tweet back in 2022:

FGPA-ASIC Hybrid Structure

In this hybrid structure, the ASIC performs the common operations, while the FPGA do the circuit-specific computations. These two hardware devices are then put on the same PCB board and connected via high-bandwidth SerDes channels. Alternatively, an on-chip CPU core, like RISC-V based or ARM based, can be used for similar purposes. These hybrid approaches works in general with extremely high requirement on cost and manufacturing quality. In the past half year, we are asking ourselves all them time:

Is the hybrid approach the best structure we can think of? Can we rely on any technical progress from the ZK community to improve our design?

In the following, we give positive answers to the above question.

Cysic’s approach, the past and the present

Before diving into the technical details, we would like to some necessary preliminaries on ZKP. A typical proof generation process for the Plonkish proof system can be categorized into the following phases (For a detailed anatomy of proof generation, please read this blog by Scroll.):

  1. Write out the witness: The witness, also known as trace, refers to some data, which along with other data shows why a statement is true. The write operation is done via a 2-dimensional matrix, called the trace table. Every entry in this table is an element of a finite field. The process of filling the trace table is called witness generation, which requires iterating over each cell in the table and filling the right value. This process requires arithmetic over finite field and is tailored to specific ZK circuit.
  2. Commit to the witness: After the witness generation, we obtain a trace table, where each column in the table is interpreted as polynomial using Lagrange interpolation. Then different commitment schemes can be used to commit to these polynomials, like KZG and FRI. The main computation involved here includes MSM, (I)NTT, polynomial evaluation and Merkle tree. This is the bottleneck of proof generation, since the complicated computation is performed over large finite field and the data needed for the computation is massive.
  3. Prove that the witness is correct: Now that the trace table is filled and the commitment is computed. The only todo thing left is to show the original trace is valid, which means some particular constrains are satisfied. The computation includes (I)NTT, MSM and polynomial evaluations.

In summary, the computation in proof generation consists of several common modules, such as MSM, NTT, Merkle tree and polynomial evaluations, as well as some fluffy modules. In our previous blog, we show some high-level strategies in optimization these common modules. In the past years, the community also presented some promising techniques in accelerating these common modules (see the ones written by Ulvetanna, Ingonyama and other teams). We do not want to iterate these techniques here. In general, those modules are not bottlenecks any more in terms of performance, but the end-to-end proof acceleration is far from satisfying. This half-baked accelerator can be seen as a specialized version of GPU with some performance gain. The rough comparisons are as follows:

  • Win: In addition to the conventional GPU-style SIMD/SIMT style parallel computing model, there is specialized support for ZK computations. This allows us to implement ZK operations with full performance without relying on cutting-edge CUDA programming skills (such as writing large integer operations with CUDA).
  • Draw: Programming complexity
  • For the accelerator, we provide a high-level programming model similar to PyTorch style in AI, with the goal of providing a coding experience that is “as if directly translated from a paper” when a part of the port prover is placed on the accelerator. Although we provide flexible scheduling and control capabilities at the hardware level, this requires an understanding of hardware design at the underlying level.
  • For GPUs, users have relatively complete freedom of control when using CUDA directly. They can perform arbitrary optimization. But this also means that they have everything from scratch.

Obviously, this half-baked accelerator does not achieve an optimal end-to-end proof acceleration or a user-friendly programming interface. We definitely need some new ingredients in our soup. The new ingredients is zkVM!

zkVM Intro

Virtual machine is an old topic in computer science. It is basically a program that can run other programs. The Ethereum virtual machine (EVM) runs Ethereum smart contracts, where the supported instructions are specified in this yellow paper. We know that a ZK proof system deals with circuits, so a zkVM is a circuit that can run a sequence of supported instructions. In addition to the execution result, the zkVM also outputs a proof showing the execution trace of the VM corresponding to the sequence of instructions is valid. Simply put, a zkVM is a ZK circuit that runs a VM (as summarized in David Wong’s post). There are two parts in the zkVM design that worth considerations:

  • The supported instruction set: This means the operations the VM can perform. There are several established players in the space, such as Risc0, Succinct, Starknet, Polygon, Metis and others, which works on different instruction sets, like RISC-V, MIPS or customized instruction set.
  • The ZK architecture: This part is about the ZK proofs generated along with the execution result. The ZK architecture is almost agnostic to the underlying VM design, but there are still some subtle balance that needs to be considered.

There is a nice feature in zkVM design, called continuation (from RISC0). In the zkVM execution, the continuation is a mechanism for splitting a large program into several small segments that can be computed and proven independently, as shown in the picture below:

The Segmentation Process, from Risc0

This feature is hardware friendly for the following reasons:

  • Parallel: Because of the independency among these sliced segments, they can be distributed to multiple hardware to generated the corresponding proofs simultaneously.
  • Minimal IO bandwidth requirement: The proof generation for zkVM follows “small-in-small-out” pattern. For instance, in Risc0, the size of the segment for proof generation is around 50MB and the output is a FRI-based proof of roughly the size of 250KB. This special pattern largely decreases the IO bandwidth requirement.
  • Controlled memory requirement: Although the input and output of each proof generation kernel is small, the memory requirement is larger, in the range of tens of GB. However, the size of required memory depends on the size of the segments, which is adaptable based on the design of the zkVM.

Based on these hardware-friendly properties, we describe our hardware design below:

The zkVM-based hardware design

The architecture of the system is rather simple, with an executor in charge of executing the program, hardware controlling as well as distributing the segments and a configurable number of specialized chips to generated ZK proof for each segment program.

This simple architecture allows flexible forms of our hardware. For instance, we can pack the executor (using a weak CPU or on chip CPU core), a certain number of zkVM chips along with other necessary hardware components (such as memory) in to a chassis. A simpler case is to pack multiple chips in a portable machine, like a Macbook charger.

The zkVM hardware consists of several computing cores:

  • A programmable vector machine for vectorized operation.
  • Specialized NTT module for 31-bit, 64-bit and 256-bit fields.
  • Specialized MSM module BN254, BLS12–377 and BLS12–388 curves.
  • Configurable hash function unit for the field operation based hash function.

In addition to the nice things brought by the ZKVM, this new paradigm of designing ZK-ASICs also transforms to great products, used by individuals or enterprize, as shown below:

ZK ASIC products

Call for collaborations

This zkVM hardware project aims to construct a performant and cost effective hardware for wide-ranging use cases, performance, and development productivity. We seek varied perspectives, innovative ideas, and unwavering dedication to improving and broadening our hardware design. We look forward to the community’s inputs and are ready to provide guidance and support to anyone who wants to get involved.

--

--