The fastest formulas for elliptic curve operations were published by Hisil, Wong, Carter, and Dawson in their 2008 paper Twisted Edwards Curves Revisited. Their paper also describes a parallel version of their formulas, designed to execute four streams of instructions on four independent processors. Until now, these parallel formulas don’t seem to have been implemented in software. But a closer look reveals that slightly modifying the formulas allows the expensive instructions to be executed in uniform, making a vectorized SIMD implementation possible.
I implemented this strategy in Rust, targeting 256-bit wide AVX2 operations. The resulting implementation performs double-base scalar multiplication faster than other Ed25519 implementations I tested, and is even faster than FourQ without endomorphisms:
More importantly, when combined with Ristretto, the result is the fastest prime-order group ever, allowing faster implementations of complex zero-knowledge protocols.
For example, using it for Bulletproofs gives a 2x speedup (compared to the
libsecp256k1 implementation on the same hardware), and using it for the algebraic-MAC anonymous credentials of Chase, Meiklejohn, and Zaverucha gives a 20x speedup (compared to their paper’s timings, which used OpenSSL’s P-256 on Sandy Bridge).
And, since our library is intended as a reusable, mid-level building block, these improvements benefit every protocol implementation using it.
The following three sections describe this strategy and its implementation:
- Parallel Edwards Formulas briefly describes the parallel formulas and how to modify them to make them SIMD-friendly;
- Implementing Parallel Formulas with AVX2 describes how the formulas are implemented;
- and Obstacles to Parallelism describes the tradeoffs between this strategy and the existing serial one, as well as some future directions.
Parallel Edwards Formulas
The Twisted Edwards Curves Revisited paper describes 4-way parallel versions of their formulas, as four instruction streams for four independently executing processors. This isn’t practical to implement in software, since the cost of synchronization far outweighs the cost of the parallelized operations: a field multiplication costs as little as tens of cycles, while a CPU context switch costs at least tens of thousands of cycles.
But a closer inspection reveals that the (more expensive) multiplication and squaring steps are uniform, while the instruction divergence occurs in the (much cheaper) addition and subtraction steps. This means that a SIMD implementation can perform the expensive steps uniformly, and handle divergence in the inexpensive steps using masking.
The exact formulas are described in detail in my notes, but the following diagram shows the (re)addition formulas for adding two points
P_1 = (X_1:Y_1:Z_1:T_1) and
P_2 = (X_2:Y_2:Z_2:T_2).
When performing scalar multiplications, we often want to perform “readdition”: repeatedly adding the same
P_2 to varying points
P_1. For example, fixed-window scalar multiplication builds a lookup table of a few small multiples of the input point, and repeatedly adds table entries to an accumulator. In this case, we can save the last entry inside the Precomputation box, so that readdition costs only two multiplications. (This precomputation was described in the Ed25519 paper, but is actually due to Hisil, Wong, Carter, and Dawson).
There are also parallel formulas for doubling, which cost one multiplication and one squaring. However, in contrast to the parallel addition formulas, the pattern of additions and subtractions between multiplications is much less regular, causing more vectorization overhead. Worse, the coefficents grow larger, requiring some finesse to avoid overflows.
Implementing Parallel Formulas with AVX2
The formulas above are written and implemented in terms of a vector of four field elements, provided by a
FieldElement32x4 type. The
FieldElement32x4 is implemented using AVX2 operations on 256-bit vectors, but abstracts away the platform-specific details. This means that point operations can be implemented only in terms of vectors of field elements, making the code much cleaner. For instance, the implementation of the readdition formulas drawn above looks like:
The internals of the AVX2 code are oriented around the vector integer multiplier available with the
vpmuluqdq instruction. This instruction operates on two 256-bit vectors, multiplying the low 32 bits of each 64-bit lane of the input operands and producing 64 bits of output:
(a0 __ b0 __ c0 __ d0 __)
(a1 __ b1 __ c1 __ d1 __)
(a0*a1 b0*b1 c0*c1 d0*d1)
This allows computing 4 integer multiplications in parallel. Since we want to compute 4 field multiplications in parallel, we don’t need to do parallelism within a field multiplication. Field elements are represented in the usual way by 10 32-bit coefficients in radix 25.5, so the parallel field multiplications and squarings are performed in the same way as the serial 32-bit code, using one SIMD lane for each field element. For more details on the AVX2 representation, see the implementation notes.
Obstacles to parallelism
Although these formulas provide 4-way parallelism, they don’t provide a 4-way speedup. The reason is that the speedup from the vectorized implementation must overcome vectorization overhead.
One overhead is in the number of field operations: the HWCD paper suggests a mixed-model strategy, dropping the
T coordinate when computing repeated doublings. This optimization isn’t possible when using the parallel formulas, so doublings using the parallel formulas do slightly more work in total. In addition, the less regular structure of additions and subtractions in the doubling formulas gives more SIMD vectorization overhead, so the parallel formulas do better when there are proportionately more additions and fewer doublings (such as in a large multiscalar multiplication).
Another overhead is in how the field operations can be performed in terms of the CPU’s integer operations: the vector multiplier can perform only
32x32->64 bit multiplications, while the serial multiplier can perform
64x64->128 bit multiplications. So while the multiplications can be done in parallel, each multiplication does less work. In addition, the serial multiplier can efficiently handle carries using the ADX instruction set, while the vector multiplier can’t.
This overhead would be greatly reduced using the IFMA52 instructions, if Intel could get their 10nm process working and ship those CPUs. These instructions allow computing the high or low half of a product of 52-bit integers, and adding it to a 64-bit accumulator, in one instruction. Since the formulas already have 4-way parallelism, using the full width of AVX-512 IFMA52 instructions would require only 2-way parallelism within a field operation, or 2-way parallelism across point operations.
Finally, Intel CPUs cannot access higher-frequency turbo states while executing wide vector instructions, so mixed workloads might suffer. However, for complex applications, such as Bulletproofs, the sustained workload seems to make it worthwhile. By contrast, AMD’s Zen processors execute AVX2 code without thermal throttling, but only at half rate, so they run the vectorized code twice as slowly as Skylake processors. (However, the vectorized strategy is still faster than the serial strategy on Zen, since Zen is also twice as slow at serial multiplications).
This implementation is already available in
curve25519-dalek using the
avx2_backend feature. In the future, it’d be interesting to try to add a NEON backend for ARM, to add an IFMA52 backend (sometime next year, when those CPUs will hopefully be available), or to use parallelism across group operations (for instance, while performing a multiscalar multiplication), or to use the field element vectors to perform SIMD-parallel point decompression for batch verification.
Thanks to Isis Agora Lovecruft, Deirdre Connolly, Jack Grigg, Sean Bowe, and Daira Hopwood for reading a draft of this post.