The end game of MSM-batch affine in elliptic curve point addition
Since my idea about hardware implementation of batch affine in elliptic curve point addition, which originally came from Aztec barretenberg library, a lot of things have passed.The Zprize 23 winners, Superscalar and Snarkify, particularly in the FPGA track, all used affine points as their coordinates.
Benefits
The reason is simple: it uses the minimum amount of modular multiplications, for all curves. If you took all the costs into consideration, such as tree compression and decompression, the amortized cost per point addition is 6, with one shared modular inversion module. In comparison, twisted Edwards coordinate for Aleo bls12–377 needs 6 modular multiplications and for Projective or Jacobian coordinate for bn256 and bls12–381 curves, it needs 12.
What’s more, the storage per point in buckets is 2 field elements, compared with 4 in twisted Edwards and 3 in Projective or Jacobian. It means the i/o for each process unit are 6 field elements, compared with 12 in twisted Edwards and 9 in Projective or Jacobian.
And if you go for ASIC, there is no complex placement and routing for such a batched affine point addition module. It resembles the following image. I’ve deliberately decoupled and removed the control module because it operates independently with different optimization strategies.
If you go to Ingonyama’s blog about their implementation with Snarkify Niall, you could find their detailed explanation about compression, etc. They have different data widths between modules, such as the inversion module. It takes 3*336 Fq and compresses them to 3*42, which is equivalent to 8/1 compression, a tree with height 3. It then uses another tree to compress from 126 to 4, achieving approximately 32:1 compression with a height of 5. After that, it simultaneously inverts 4 Fq values. (Well, need to dig into the code if I have some time). So the overal design is pretty similar, but the details are differnt because of their inverter or the resources of the FPGA. What I showed, is the idea data path for ASIC, the end game.
Critical part - the modular inversion
The point addition module is very simple; everybody just follows the formula.
What really matters is the modular inversion module.
The first part is the algorithm. One way is to build with the low-latency modular multiplier as in VDF. But there are still choices out there. I got my secret ingredient.
The second part is about where you put it.
strategy1:
Put the inversion in each chip. To maximize sharing among point addition modules (PAs), many PAs are placed on a single chip, which increases the chip area and reduces manufacturing yield. If you put just 4 or 8 PAs in the chip, it doesn’t reduce the cost too much.
strategy2:
Put the inversion in just one chip. Since all the chips must have the same design, this is also not possible. Besides, the chips need to communicate with each other at high bandwidth, as follows:
strategy3:
If you are interested in building one ASIC with MSM, please contact me and we can have a discussion.
Build an ASIC costs so much capital that
- It requires a very clear market, or so-called product-market-fit (PMF).
- It requires knowledge of existing and potential state-of-the-art design to remain competitive in the next 2–3 years.
Anyway, this is a long journey, and everybody must stay patient.