Even faster Edwards curves with IFMA

In a previous post, I described a SIMD-friendly implementation strategy for Edwards curve arithmetic, which makes use of parallelism in the formulas for curve operations, as well as the performance results of an AVX2 implementation of that strategy.

Intel’s AVX512 instruction set includes an extension, AVX512-IFMA, aimed at accelerating integer arithmetic. I wrote a new curve25519-dalek backend in Rust, using these instructions to implement the parallel formula strategy. This gives another performance gain of about 1.5x relative to my AVX2 implementation, or 2.3x relative to ed25519-donna, and it seems likely that it’s the fastest Curve25519 implementation ever.

More interestingly, it almost closes the performance gap (for now) between Curve25519 and the newer and faster FourQ curve from Microsoft Research, even when FourQ uses the (patented) endomorphism speedup:

Creating a bar chart in Google sheets with one differently-coloured bar: among the hardest problems in computer science.

The chart shows the performance of several elliptic curve implementations on a double-base scalar multiplication, used as a proxy for performance of group operations generally. (The libsecp numbers marked with an asterisk are estimates, obtained by subtracting the cost of a scalar inversion from an ECDSA verification).

Because curve25519-dalek is designed as a reusable mid-level library, aimed at making it easier to implement elliptic-curve protocols in general, this gives a transparent speedup to any implementation using it. For instance, the Bulletproofs implementation gets about 1.5x faster, depending on the benchmark, improving its speedup to 3x (relative to libsecp) or 7x (relative to Monero).

The speedup relative to FourQ is only temporary, because the FourQ implementation isn’t using IFMA instructions. Since the SIMD-parallel formulas strategy isn’t related to Curve25519, but works for any Edwards curve, the same approach should also give a speedup for FourQ, and might actually be slightly nicer due to some design benefits of FourQ over Curve25519.

IFMA Instructions

AVX512-IFMA consists of two instructions, vpmadd52luq and vpmadd52huq, which multiply the low 52 bits of source operands to a 104-bit product, then add either the low (luq) or high (huq) 52 bits of the 104-bit product into a 64-bit accumulator. Using 52-bit operands means that the instructions can be implemented using existing floating-point circuitry.

These instructions are useful because they give SIMD code access to much more powerful integer multipliers. Previously, SIMD code could only use a 32x32->64-bit multiplier via the vpmuludq instruction, whereas serial code could use a 64x64->128 -bit multiplier via the mulx instruction.

IFMA instructions are currently only available in Cannonlake CPUs, whose availability has been stymied by the years of delay in Intel’s 10nm manufacturing process. However, Intel finally released a dual-core Cannonlake CPU, the i3–8121U, for use in the Lenovo IdeaPad 330, available for sale only in China, which I used to develop and test the implementation. (Thanks to Cathie Yun for helping me get a hold of one).

purr programming on the IFMA backend

The i3–8121U implements 512-bit IFMA instructions at half the rate of 256-bit ones. As yet, there’s no word on whether or not AMD’s Zen 2 architecture will support IFMA, but if it does it will also execute at 256-bits wide, so there doesn’t seem to be a clear downside to using 256-bit vectors (which match the 4-way parallelism available in the formulas).

Adding intrinsics to Rust

Rust doesn’t yet have support for AVX512 intrinsics, so there is also no support for AVX512-IFMA. However, LLVM does have support for both — what’s missing in Rust is just the glue connecting the LLVM internals to a Rust API. To work around this until the intrinsics are added upstream, it’s possible to hook into the LLVM internals manually:

// Need this to enable the intrinsic definitions
#![feature(simd_ffi, link_llvm_intrinsics)]
// The original Rust std::simd had typed vectors, which 
// moved into the packed_simd crate. I like them better.
use packed_simd::u64x4;

// The `link_name`s below are pulled out of LLVM tablegen, have
// changed in the past, and might change again in the future.
#[allow(improper_ctypes)]
extern "C" {
#[link_name = "llvm.x86.avx512.vpmadd52l.uq.256"]
fn madd52lo(z: u64x4, x: u64x4, y: u64x4) -> u64x4;
#[link_name = "llvm.x86.avx512.vpmadd52h.uq.256"]
fn madd52hi(z: u64x4, x: u64x4, y: u64x4) -> u64x4;
}

Unfortunately, the lack of wide availability of the processors make development difficult. For instance, I couldn’t get Linux’s perf infrastructure to work on the i3–8121U, so the entire implementation is written somewhat blindly, without the ability to precisely measure performance using CPU performance counters. On the upside, this means that there’s still probably a bit of room to optimize it later. More details on the areas that could be improved can be found in the implementation notes, linked below.

Implementation details

Medium doesn’t support math, so there are lots of interesting details covered in the implementation notes that I can’t easily put here.

  • The notes on the IFMA backend give an overview of the IFMA instructions, discussion of previous work which uses them for big-integer arithmetic, and an overview of the strategy for using them to implement field arithmetic.
  • The notes on the parent module, the curve25519-dalek vector backend, explain the parallel Edwards strategy in general.

These are the notes embedded with the implementation, not standalone, but in the future, I’m planning to write up some standalone notes on the parallel strategy and these implementations.