Unlocking Efficiency: How X Layer Optimized Polygon zkEVM Prover for GPU Acceleration

Published in

X Layer

5 min readJun 21, 2024

Introduction

The zkEVM prover is a key component of Polygon’s zk-rollup framework that enables scalable and low-cost Ethereum transactions. ZK-rollups work by aggregating transactions into batches and securing them with zero-knowledge proofs.

The Polygon team has performed significant optimizations on the zkEVM prover, resulting in a good prover performance. However, the existing optimizations have primarily targeted CPU computation, such as multithreading, AVX, AVX512, etc. The prover contains many computations that have high parallelism and are therefore well-suited for GPU acceleration, that’s why we have been working on optimizing the Polygon zkEVM prover for GPU acceleration.

Here are some progress we’ve made:

Profiling

Here is a more detailed analysis of the function call stack：

--29.04%--PoseidonGoldilocks::merkletree_avx
          |          
           --29.04%--PoseidonGoldilocks::linear_hash
                     |          
                      --28.76%--PoseidonGoldilocks::hash_full_result
                                |          
                                |--9.34%--Goldilocks::mmult_avx_4x12_8
                                |          
                                |--4.30%--PoseidonGoldilocks::pow7_avx
                                |          
                                 --2.42%--Goldilocks::mmult_avx_4x12
          
--10.31%--NTT_Goldilocks::NTT_iters
          |          
           --0.75%--0x7f0ce7fa077b
          
--4.37%--ZkevmSteps::step52ns_parser_first_avx
          |          
           --2.13%--Goldilocks3::mul_avx

Therefore, we can see that the three steps that cause the greatest performance overhead are Merkle Tree building, NTT (Number-Theoretic Transform) and step52ns_parser_first_avx

Profiling on GCP n2d-standard-224 VM Ubuntu 22.04 (AMD Rome CPU)

Optimization Approaches

Given an execution trace matrix with r rows and c columns where each column represents a polynomial, the Merkle Tree root is computed in 4 steps.

do extended INTT on each column with a blowup factor b
do NTT on each extended column to get the extended polynomial
hash every row (data block) to form Merkle Tree leaves
hash Merkle Tree branches into the Merkle Tree root

For the largest matrix in zkEVM, the number of rows r is 2^23, the number of columns c is 751, and the blowup factor b is 2. Each matrix element comes from the Goldilocks finite field which uses 64 bits per element. Therefore, the total memory required is: 751 * 2^23 * 2 * 8 = 94 GB.

NTT Optimization

We partition the entire matrix by columns because we can perform the NTT independently on each column of data. If we have 8 GPUs, then we can partition the entire matrix into 8 parts, with each part having 94 columns (and the last part having 93 columns). By sequentially carrying out the extended INTT and NTT transformations, we derive an extended polynomial.

The most famous algorithm for fast Fourier transforms and NTT is the Cooley–Tukey algorithm. This method leverages the symmetries of the nᵗʰ roots of unity to compute the transformation in a logarithmic number of loops over the vector.

https://www.irreducible.com/posts/fpga-architecture-for-goldilocks-ntt

Since Goldilocks’ computations within a finite field are modular operations, an efficient implementation of modular arithmetic relies on assembly language. However, due to the differences in instruction set architectures between CPUs and GPUs, we need to reimplement a set of basic modular operations of Goldilocks elements based on the GPU instruction set in order to optimize the computations on GPUs. Our implementation uses the Goldilocks field CUDA optimisations from Supranational*.

*https://github.com/supranational/sppark/blob/main/ff/gl64_t.cuh

Merkle Tree Optimization

Firstly, we need to perform Poseidon hash based on rows. So we need to split the entire matrix row-wise. However, NTT computation splits the data column-wise. Therefore, we have two options to re-split (transpose) the data:

Copy all data from devices to host and re-split it, then copy it back to devices.
Leverage peer-to-peer copy between devices to effectively re-split the data.

Generally speaking, a peer-to-peer copy is more efficient than a device-host copy. But by using cudaMallocHost to allocate pinned host memory, method 1's efficiency was significantly increased. Additionally, method 1 no longer requires extra cuda memory unlike method 2. Another reason is insufficient GPU memory to do device-to-device (peer-to-peer) copy: the receiver (GPU) needs two buffers for the partial matrix (one buffer for its own matrix partition and one buffer for incoming data). These two buffers need 2 * 94 * 2^24 * 8 = 23.5 GB. Since there are also other allocated data structures and since the RTX 4090 GPU has only 24 GB RAM, there is not enough space for these buffers to perform peer-to-peer copy*.

*https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gab84100ae1fa1b12eaca660207ef585b

Furthermore, we need to independently perform Poseidon hash calculations on 2²⁴ rows, which requires an extremely high level of parallelism, allowing GPUs to fully leverage their parallel computing capabilities.

As for hashing the Merkle Tree branches into the Merkle Tree root, this process does not take a long time. We use a single GPU to calculate it.

https://www.irreducible.com/posts/poseidon-merkle-trees-in-hardware

step52ns_parser_first_avx Optimization

Purely from a computational perspective, this step can be parallelized. But each computation thread needs to read data from any position in the entire matrix, which means we cannot partition the data. Therefore, we have temporarily abandoned the optimization of this step.

Other Optimization

For GPU acceleration of each step, there are three key steps:

Copy the data from the host (CPU) to the device (GPU)
Perform the computation on the device
Copy the computed results back from the device to the host

While computational efficiency is improved through parallelization, the data transfer overheads are non-negligible.

Using pinned memory can to some extent accelerate memory copy operations. We just need to use cudaMallocManaged instead of malloc to allocate the memory*.

*https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd228014f19cc0975ebe3e0dd2af6dd1b

Benchmarks

About X Layer

X Layer is a zkEVM Layer 2 network that connects the OKX and Ethereum communities, allowing anyone to participate in a truly global on-chain ecosystem.

Connect with us

Website | Twitter | LinkedIn | Discord

Disclaimer: This article/blog is provided for informational purposes only. It represents the views of the author(s) and does not represent the views of OKX. It is not intended to provide (i) investment advice or an investment recommendation; (ii) an offer or solicitation to buy, sell, or hold digital assets; or (iii) financial, accounting, legal, or tax advice. Digital asset holdings, including stablecoins and NFTs, involve a high degree of risk, can fluctuate greatly, and can even become worthless. You should carefully consider whether trading or holding digital assets is suitable for you in light of your financial condition. The products and services referred to here may not be available in your region. Please refer to the OKX Terms of Service for more details.