Optimizing Threshold Secret Sharing
The Threshold Secret Sharing library is the first venture at Snips into the Rust world. Snips uses multiple technologies and languages, so adding one to the mix was not necessarily an obvious decision. We will show here how great a tool Rust can be when having to mix high-level programming concepts with bit-fiddling tricks.
Rust
Snips develops mobile applications and technologies. Android and iOS, the two leading platforms, are both offering high level languages to application developers: Apple is putting Swift center front, Android is supporting various languages from the Java ecosystem, like Scala or Kotlin. But so far, sharing code on both platforms requires getting one’s hands dirty doing C or C++ to generate native library to be embedded in the applications.
In the last months, though, there has been a flurry of activity around Rust, the newcomer in the low-level language field. Rust has several properties that makes it attractive to platform-agnostic developers:
- Rust leverages LLVM, a battle-proofed compiler framework. LLVM allows the compiler developer to focus on the language frontend and reap the benefits of LLVM’s ability to compile code to run on most modern architectures, on mobile, desktop or server.
- Rust is a language of its time: most developers are convinced today of the
benefits of injecting functional style and patterns in a mostly-imperative
object-oriented language. After Scala showed the way a few years ago, both Swift and Kotlin are making similar choices. These three languages being already widely used at Snips makes Rust a natural candidate for low-level programming. - Rust tackles head-front some problems that older languages have traditionally left for third parties to deal with: the compiler ships with Cargo a dependency and build manager. Cross-compilation is also a first-citizen feature: concerted efforts have been made to make it easy to install cross-compiling versions of rustc, and/or switch from one version to another. It’s not yet a smooth ride, but it’s never been that easy.
- Finally, the language manages zero-cost abstraction in the same spirit as C++, while maintaining the same safety level than a language with a Garbage Collector. This is particularly attractive when high performance is required: Snips, as a company, is dealing with on-device machine-learning, as well as modern cryptographic techniques on a phone, a relatively modest device.
Fast Fourier Transform optimization
Back to the problem at hand. As we explained in our previous post, we are working on the secret sharing problem, generalizing Shamir’s scheme to
share a vector of numeric values instead of only a single value. Our packed scheme relies heavily on both interpolating and evaluating polynomials in modular arithmetics.
The scheme security requires to work on two disjoints sets of points where a polynomial will be either evaluated or interpolated. By making sure these two sets of points have certain properties, we can effectively performs both the interpolation and evaluation using a Discrete Fourier Transform (DFT).
Why bother? Well, in the general case polynomial interpolation and evaluation are expensive operations: if n is the number of points we are working with and d the degree of the polynomial, we reach asymptotic complexities of O(n.d) for evaluation and O(n.d²) for interpolation.
DFT, on the other hand, can be performed in O(d.log(d)) operations (n is necessarily d-1 in the DFT case).
Optimized DFT is called “Fast Fourier Transform” (FFT), and is so
recurrent an exercise that both terms are often confused. As terminology is concerned, the specific topic of DFT in modular arithmetics is called numeric-theoric transforms.
The most common algorithm for FFT is named Cooley–Tukey, even if it was a re-discovery of a Gauss method. The basic principle is divide and conquer: the vector is divided into parts of equal size, an FFT is performed on each
sub-vector, and the results are recombined with a linear number of operation. The more factorizable your vector size is, the better. We picked sizes in the form of powers of 2 and powers of 3, so we have two excellent factorizations.
Our first operational Rust implementation of Cooley-Tukey looked like that, with a very similar method for powers of 3 vector sizes. (We will call them fft2 and fft3 in the rest of this post). It takes a vector of power-of-2 size, a principal root of unity omega, and the modulo we will work with as prime. This implementation is very clear. It make the three steps in each stage very explicit: first divide the vector, forms two sub-vectors with the even and odd numbered elements, recurse over each sub-vector, then recombine the results.
Now, while this implementation is very nice looking, there are other well-known ways to organize the computation: without changing the number of arithmetic operations performed, it is possible to perform it “in-place”, without relying on vectors allocated to store the intermediate input and output of each stage.
In place implementation
The way to achieve great performance in computing FFT is already well covered: variants of the Cooley-Tukey algorithms are described and many implementations exist in different languages, including Rust. Most of the time, though, the “payload”, the actual numbers they work on are complex numbers. We are interested in modular integers but the benefits of re-organization are the same.
Note also that nothing we will be doing here changes the asymptotic complexity of the FFT operation: we will be doing roughly the same number of addition and multiplications, but “organizing” the data and the computation in different order allows the processor to make a better usage of its cache and its pipeline.
In order to be as cache-friendly as possible, the most efficient variants perform a weird rearranging of the data vector first. Reshuffling the vector helps the upcoming step by making it easier to compute indices, and by grouping together data points that it will have to work with at the same time. By making a relatively small number of random accesses in this reordering step, it makes the expensive second step access the data in a “more natural” order.
The second step performs the body of the computation itself. Instead of a recursion, it assumes an iterative form. Recursive functions cannot be inlined by the optimizer in the general case. Rewritting it in iterative form help the compiler a lot: we get rid of cycle consuming function call overhead but also stop moving up and down the stack, which does not play well with the cache.
Rust has a built-in bench runner. It is very easy to set up micro-benchmarks of our two versions:
The first two lines are for fft2, for vector of a power-of-2 size, the two last one for power-of-3 size. In each group, the first line is the recursive implementation, the second the in place iterative one. Number are nanoseconds per fft operation: the smaller, the better.
We get a 5-time improvement over fft2, and a 10-time improvement on fft3.
Lazy modulo
But we can do better.
The “core business” of our FFT is actually performing a few multiplications
and additions followed by a modulo. And modulo is actually a pretty expensive operation in the general case: a processor will typically run a comparison, addition, subtraction in one cycle, a multiplication in three, but a modulo requires a full integer division and takes about 45 cycles.
Modulo is distributive over the additions and multiplications the FFT are doing. From a mathematical point of view, we could make the iterative part of the computation work on integers and apply the modulo at the end… except of course, this could generate temporary values much bigger that the
64bit integers we can easily store and manipulate.
Without delaying the modulo application that far, we implemented a variant where we manage to not apply modulo at every step of the computation. The pattern of adding and multiplying each bit of data is actually very predictible, and if the prime we are operating with is significantly smaller than our register size, it is relatively easy to suppress about half the expensive modulo operations.
This attempt lead us to split the fft2 performance in half again in our micro benchmark. The fft3 modification impact was less impressive, about one third of the time went away.
All in all, this hackish optimization got us to a tenfold overall improvement on both cases, and validates the concern about the cost of the modulo operation.
Modulo optimization
But we still can do better.
Modular arithmetic being a subject of interest in cryptography, a number of
strategies to work around the huge cost of the modulo have been devised. Many of them are targeted specifically at huge numbers (the kind that RSA works with). In our case, though, the focus is on word-sized number, that is 32bit (or eventually 64bit) numbers. Algorithms exists, but their most current implementations are not necessarily fitting.
Montgomery scheme, for instance, allows to “exchange” the modulo of the field we are working with. We can transform the input numbers from the “prime” field to another field where we can choose the modulo freely. In
the new field, multiplication is a bit more expensive (it requires three
multiplications instead of one). But the good news is, we get to choose the modulo of the new field: we can swap a general case modulo for a power-of-2 modulo which can be implemented by a bit-shift operation, in one cycle. General modulo takes about 45 cycles, multiplication about 3. In the initial ring we wait about 48 cycles. In the new field, we get at an estimated 15.
We also need to pay for translation from the original ring to the second, and
back when the FFT is performed. These two operations both require a modulo reduction in the initial ring, but their cost will be amortized.
Abstracting numeric Field at zero-cost
While we could change our actual FFT function and plug a new integer scheme, it would make a mess of an already difficult function (or functions, if you consider fft2 and fft3).
Rust making a point of offering zero-cost abstractions, we can use method
templating to re-wire our FFTs improved algorithms. We are actually abstracting over the numeric field. This leaves our two levels of optimisations (FFT and arithmetic) segregated, and allows us to wire together various combination, while keeping implementation of each part at a manageable complexity.
Let’s define our abstract Field first:
The concrete list of functions is a bit arbitrary here, we have limited it to what we needed to get the FFT to work. Field::U must be defined by a concrete field implementation as the type for its elements. Using a different type allows us to statically check for mix-up of field elements with Rust integer primitives.
So far, this Field trait would work with our modular fields as well as the
complex field. To deal specifically with modular arithmetics, we can say a few more things and define an extended ZpField: such a field can be build from a modulo, and can be given conversion `from` and `back` to raw integers.
A simple reference implementation for ZpField would be equivalent to the “eager” case: after each operation, apply the modulo.
Finally we need to re-wire the actual fft functions to use the field values and their operators. The rearranging step is nearly identical, but the computing step needs a bit more work:
As promised we get abstract over the field we work with. F::U replaces the raw primitive we were using before, and instead of a prime, we are giving an actual implementation of Field (called zp) to the function. Zero and one representations now have to be obtained from the field and the addition and multiplication operators are replaced by calls to zp. We use an ancillary
power operator which is defined as a generic method as well (not shown here).
Now for the real “zero-cost” test: theory says that a fft2 method will be generated for our specific zp field, and our add and mul calls will get inlined in the generated code. Then the optimizer can work on them and make the same optimisation it would do if they were actually there in the code. In the end, we should get similar performance to the earlier “eager” case (in first). Our zero-cost field-abstracted comes on the second line.
There is a bit of variability, a laptop is not the best environment to perform reproducible micro-benchmarks, but we are in the same ballpark. We are actually consistently one little bit better with the naive_zp abstracted implementation for a still-intriguing reason. So hurrah for the zero-cost abstraction, the cost looks even slightly negative.
Montgomery multiplication
Let’s now throw our Montgomery field in the mix: it is a bit more complicated to implement but still acceptable. We use hackers delight’s implementation notes as reference, transposing their work from 64bit to 32bit (and from C to Rust).
The code for Montgomery multiplication is not much more difficult than the naive field code. The fft2 code is unchanged, but let’s have a look at the generated assembly for a change. Montgomery scheme replaces one multiplication plus modulo by three multiplication and a few cheaper operations. By looking for the atypical imull (a 32 bit multiplication used in the scheme), we can find this pattern appearing in many places in the code, more or less distorted by the optimizer to fit the specific location.
Now that we are confident the method calls to the Montgomery scheme have been inlined and optimized, despite its higher complexity, let’s check on the performance:
So our final results is a 19 times improvement for fft2, and 15 times for fft3.
On a laptop.
But as our real target is smaller devices (phones), results on 32bit ARM is also very interesting for us. RaspberryPi is a supported platform for Rust: compiling and running the bench on a Pi is just as simple as on a Mac. Just
a tad slower.
On RaspberryPi, the improvements are even bigger. 66 times for fft2, 57 for fft3. Doing most operations on native 32bit size integer instead of using the 64bit instructions set helps a lot.
Conclusion
First, we have greatly improved the raw performance of our library: depending on the architecture and the function, we measure a 15 to 66 factor improvement. It is unlikely that it actually extends the possible scope of our application: the current bottleneck is currently on the network side. Yet, it will allow us to use it with less concerns about the occasional slow down it could incur, or the battery life penalty. It will save a bit of energy: it’s a good deed if nothing else.
Secondly, we have tested some assumptions on Rust promises, specifically, we have an instance of an actual zero-cost abstraction where even a marginal cost would be very visible. We have verified that code generation was as good as we could expect it too be. In the context of a company working for AI on device with a strong focus on privacy, Rust’s brutal performance and safety combination may make it an important language in the toolbox.