Open VDF: Polynomial Bitwidth Analysis

Published in

Supranational

5 min readApr 17, 2020

In this blog we will dig into the ideal bitwidth to use for the coefficients of our polynomial multiplier. Recall from the blog on the Öztürk multiplier that we take the original 2048-bit number and represent it as a polynomial with redundant coefficients. This limits the need for carry propagation in any one modular squaring iteration to twice the width of a coefficient. In the FPGA design we chose a coefficient width of 16 bits plus 1 redundant bit, for a total of 17 bits. This size fit well into the available DSP resources on the FPGA we used.

Since we can build functional units of any size on an ASIC the coefficient size becomes an optimization point. We start with some basic assumptions:

D is the coefficient size, in bits
The degree of polynomial needed to represent 2048 bits will be the ceiling of 2048 divided by D
In redundant form D gets one extra bit to hold the carry from summing the next lower degree coefficient overflow bits
In redundant form the polynomial gets two extra coefficients to hold overflow bits from the highest degree coefficient

These assumptions do break down if the coefficient becomes too small or too large. However, as we’ll see, for practical sizes these assumptions hold.

For a 2048 bit modulus with 16 bit digits we get:

Coefficients: ceil(2048 / 16) + 2 = 130Bits in redundant form: ceil((2048 / 16) + 2) * 17 = 2210

If instead we choose to have 9 bit coefficients:

Coefficients: ceil(2048 / 9) + 2 = 230Bits in redundant form: ceil((2048 / 9) + 2) * 10 = 2300

Or 24 bit coefficients:

Coefficients: ceil(2048 / 24) + 2 = 88Bits in redundant form: ceil((2048 / 24) + 2) * 25 = 2200

At a high level the primary tradeoffs are:

As the coefficient size decreases, additional redundant bits are needed, resulting in a larger representation and therefore more logic and wires overall
As the coefficient size increases, the carry chain length grows, resulting in a longer latency to compute a single coefficient

The goal of this work is to find the optimal coefficient size where these competing trends produce a design with the best performance.

Results

We started by performing a theoretical complexity analysis to estimate the number of full adders and half adders required to construct the CSAs and CPAs for a range of coefficient bitwidths. Since these structures make up the bulk of the logic in the design it’s a good high level approximation for the amount of logic needed in the design. While area is not our primary concern, power and routing are, and can also be expected to scale with the number of instances.

MSU relative complexity vs. a baseline of 16-bits in non-redundant form

In the chart above the complexity (count of full adders and half adders) follows the expected trend. As coefficient size gets smaller additional bits are needed to store the redundant representation, resulting in more logic overall. The reverse is true as bitwidth increases. However, on an absolute level the differences are quite small, ranging from about 0.95 to 1.1, relative to a baseline of 16 bits. Given the level of abstraction, and the assumption that the carry chain depth will increase with larger bitwidths, this isn’t compelling enough to push the design in any particular direction.

Next we estimated power and area for a fixed latency design across various coefficient bitwidths using data from CSA and CPA synthesis results:

These results point to an optimal range between approximately 12 bits and 21 bits. Zooming in on the y-axis:

At this level the data shows a clear trend of increasing power as bitwidth increases, but is quite noisy. The area is pretty flat above 12 bits. There are some interesting dips at 14 bits, 16 bits, 18 bits, and 21 bits.

As a next step we looked at a few bitwidths by synthesizing a complete, but scaled down, 256-bit version of the design. Directionally this should correlate with a larger design, while still completing in a reasonable amount of time. The results are shown below:

The results for the full design diverge from the earlier estimates at the lower bitwidths (9 bits, 12 bits), where power is actually worse than the 16-bit baseline power. On the whole the differences are again quite small, with results overall pointing to 14 or 16 bits being optimal.

An important result to note is that bitwidths above 16 failed to meet the timing constraint. It is possible that this could be overcome with additional work, but does exemplify the challenges that result from longer carry changes at higher bitwidths.

Based on these results we did not make any changes to the starting assumption of using 16-bit coefficients. This data is still quite noisy compared to results from later stages of the ASIC design flow, so we believe it is premature to make changes. However, this data suggests that smaller sizes such as 12 bits and 14 bits are worth keeping in mind as an option down the road should timing or other issues become a problem.

Conclusions

In this work we looked at various coefficient bitwidths to determine the ideal bitwidth for the full ASIC design using both theoretical and synthesis results.

Overall, the results point to an optimal size in the 12 bit to 16 bit range. With synthesis results for power and area falling in a narrow range across these sizes we decided to continue forward with the 16-bit coefficient size at this time, while keeping in mind that a smaller coefficient bitwidth may help alleviate certain timing challenges in the future.

Source Code

Source code for the designs presented here are available on GitHub at supranational/hardware and is licensed under the Apache 2 license.

Acknowledgements

We’d like provide a special acknowledgement to Synopsys® for their support of this work.

Open VDF: Polynomial Bitwidth Analysis

Results

Conclusions

Source Code

Acknowledgements

Written by Supranational