Last week, I published version 1.1 of
curve25519-dalek, which has two main features over 1.0: a new SIMD backend using IFMA instructions, and a new API that allows using precomputation for verification checks. As far as I’m aware, the IFMA backend is the fastest Curve25519 ever; I wrote about it in a previous post, and it’s now integrated into the released library.
This post will describe the other new feature: a flexible precomputation API aimed at accelerating verification checks across a wide range of protocols.
Multiscalar multiplications versus multiple scalar multiplications
Many elliptic curve protocols spend most of their time performing scalar multiplication operations: computing
a*A for a curve point
A and a scalar (integer modulo the curve order)
a. In this post I’ll use capital letters
A,B,P,Q,... for curve points and lower-case letters
a,b,x,y,... for scalars, so it’s always possible to distinguish the type from the variable name.
Often, rather than just needing to compute a single scalar multiplication, an implementation needs to compute an equation like
a_1*A_1 + ... + a_n*A_n
It turns out that performing this computation as one operation (a multiscalar multiplication) is potentially much more efficient than computing
n individual scalar multiplications.
At a high level, this is because specifying the entire problem at once (“compute this entire equation”), rather than specifying multiple sub-problems (“compute this part, then this part, then this part”) allows the implementation to share work between the different parts, in a way that’s not possible when the scalar multiplications are done individually.
For example, one method for computing a scalar multiplication decomposes the scalar multiplication operation into a sequence of additions and doublings; in a multiscalar multiplication, the doublings can be shared between all the points, rather than done once per point. (More details are available in the
curve25519-dalek internal documentation.)
This operation is provided in
curve25519-dalek by the
VartimeMultiscalarMul traits, which do constant-time and variable-time operations respectively. These interfaces take iterators of points and scalars, allowing them to be used very flexibly, as in these examples.
Saving time with precomputation
Unfortunately, what these traits don’t allow is taking any advantage of precomputation. If the points are known up-front, it’s possible to trade memory for time by performing some precomputation on the points, then using it for each multiscalar multiplication.
For instance, an implementation can save time by building lookup tables of small multiples of the points, then using these to perform “chunks” of the scalar multiplication. Larger tables save more time (at the cost of more memory), but when the points aren’t known in advance, the time cost of the precomputation also has to be balanced against the time savings.
Specifying the problem at a high level
As noted above, one of the core reasons that multiscalar multiplication can be faster than multiple scalar multiplications is that the entire problem is specified at once, and an implementation can share work between all the parts. We can write the general problem of multiscalar multiplication with precomputation in the form
a_1*A_1 + ... + a_n*A_n + b_1*B_1 + ... + b_m*B_m,
B_1,...,B_m are points fixed in advance and the
A_1,...,A_n are points that vary with each computation.
The VartimePrecomputedMultiscalarMul trait
This functionality is now provided in curve25519-dalek by the
VartimePrecomputedMultiscalarMul trait, which takes the fixed points
B_1,...,B_m in a constructor and returns a precomputation struct with three slightly different variants of the problem:
vartime_multiscalar_mul, which handles the special case where
n=0and there are no dynamic points;
vartime_mixed_multiscalar_mul, which takes the dynamic points as already-validated
Points and is infallible;
optional_mixed_multiscalar_mul, which takes the dynamic points as
Option<Point>s and returns an
Option<Point>, so that a verification check on compressed point data can compose point decompression/validation into the input iterators.
This trait is implemented for ristretto255 points by the
VartimeRistrettoPrecomputation struct and for Curve25519 points by the
VartimeEdwardsPrecomputation struct. At the moment, these all share a common backend implementation, but they could be specialized in the future.
Currently, only variable-time computations are supported. This is because these are the computations used for verification, and many protocols can use multiscalar multiplications for verification even when they can’t use them for proving (because a verifier can combine multiple statements into one, or verify proofs in a batch).
The speedup varies by the problem size, but for medium sizes can give a respectable 20–30% speedup.
More importantly, by specifying the problem at a high-level, the underlying implementation of the multiscalar multiplication can be optimized transparently to the implementation of the protocol.
This means that any future optimization work will propagate its speedup to users of the precomputation API when they run
cargo update, with no changes required to each protocol’s implementation.