IOTA Crypto Core FPGA — 1st Progress Report

This is my first progress report about my EDF-funded project the IOTA Crypto Core FPGA. I try to do one every 4 weeks until all milestones are finished …^^

About the project

IOTA core functions like address generation, signing, “Mini-Pow” and Proof of Work (PoW) need much computational power which makes it almost impossible (in a practical sense) to be done by small embedded systems.

The aim of the project is to develop several modules that can be use by existing or new embedded applications needing IOTA core functionality.

The first is an IOTA Core FPGA module which provides most IOTA core functions with hardware acceleration. It will offer a high-level API which is easy to use whereas computationally intense low-level calculations are off-loaded to specialized logic which gives significant advantage in speed compared to a software-only solution — making it perfect for embedded applications.

Additionally, the FPGA module implements several security mechanisms which will make it very hard for attackers to unauthorizedly gain access of seeds which are stored on the module.

The second module will be a System-on-Module (SoM) which will use the FPGA module. This SoM will have enough resources to use it for a large number of applications. It even could run Linux. The SoM could be seen as an integration example for the FPGA module. It can be used unmodified for own applications but other microcontrollers could be using the FPGA module easily as well.

The third module is an application board using the SoM which will be an IOTA sensor gateway for simple and cheap sensors.

Overall, the architecture looks like this:

Currently, I’m working on the second milestone (the first was PiDiver PoW) which will be the actual FPGA core. A stock-FPGA board is used for this task and a SoC-system will be developed.

FPGA is an abbreviation for “Field programmable gate array”. It consists of Logic-blocks which can be configured inside the FPGA to build larger logic-functions up to complete CPUs. Such logic is described in a programming language like Verilog or VHDL which then is synthesized to logic by synthesis tools. The most known FPGA manufacturers Xilinx and Altera offer complete IDEs for free (for small to mid-range FPGAs). FPGAs have some advantages because you can describe true parallel working logic — but you also can describe serially executed logic by using state-machines. But FPGAs are not good for everything because often it is better to use a microcontroller for certain tasks because logic-utilization can easily explode when trying to do everything in Verilog or VHDL. For such cases there are soft-cpus (e.g. from ARM) which can be used inside the FPGA.

This project uses such a soft-cpu (Cortex M1) which is combined with hardware-accelerators. These accelerators are logic-blocks (described in VHDL) which can do one task very fast — but nothing else (like SHA3). They can be integrated into the 32Bit address-space of the CPU which gives a very good coupling and quick transfer times.

The system looks like this in the BlockDesigner of Xilinx Vivado:

(the red components have been developed — the other were available as IPs in the catalog or were part of the example design)

Comparisons

In the following sections comparisons between Cortex M1 without accelerators, Cortex M1 with accelerators, Cortex M3, Cortex M4, Raspberry Pi 3B and my desktop PC (an Core i5) are shown.

Until now, following accelerators have been developed:
- Curl-P81 (used in PoW)
- Keccak384 (used in Kerl for address generation & signing — it’s a variant of SHA3)
- Converter Bytes <-> Trits
- Troika

Curl-P81, Keccak384 and Troika are hashing-algorithms which need only one clock-cycle per round on the FPGA. The Converter is more expensive because it needs lots of divisions and multiplications.

Bytes To Trits

This conversion is done a lot because Kerl uses Keccak384 which works in binary but most IOTA functions work in trinary afterwards. The task is computationally intense because it requires lots and lots of divisions. Cortex M3 and M4 have hardware-dividers per default, but Cortex M1 doesn’t. So it was essential to build an accelerator for this conversion.

Trits To Bytes

This is the opposite. In contrary to division the Cortex M1 has hardware-multipliers which results in almost the same speed of Cortex M4. Cortex M3 is about linearly slower because the system clock is slower.

Keccak384

SHA3 runs really well on all (binary) CPUs. It’s safe and fast. But since FPGAs are very good in doing a lot of logic within one clock cycle, the M1 can benefit a lot from hardware-acceleration.

Proof-of-Work (Curl-P81)

Proof-of-Work currently is used for spam protection and uses Curl-P81 which is a trinary hash algorithm. The PoW algorithm tries to find hashes with a certain number of Trits set to 0 at the end of the hash. Millions of iterations until a valid solution is found is not uncommon. This was the first accelerator I developed and which was used on the PiDiver. The image doesn’t show PoW-times for Cortex M3 and M4 because measuring average times would have taken a long time. But I would expect it to be ten times of Raspberry Pi.

Bundle creation

In this performance test an IOTA bundle with 4 transactions (1 input, 2 outputs, 1 signature) was created. Because of lack of memory I were not able to generate one on Cortex M3 in short time — so I simply skipped it.

The winner is — not surprising — the Core i5. But the Cortex M1 with hardware-acceleration is faster than a single core of Raspberry Pi 3B.

In this test PoW is not done — that would have been unfair^^

Troika

Troika is the new light-weight hash algorithm especially optimized for trinary CPU architectures. Propably it shouldn’t be surprising too much that it doesn’t run well on binary CPU-architectures. The algorithm uses a lot of modulo and divisions with powers-of-3 which is exceptionally bad for smaller microcontrollers. For trinary CPUs a division by 3 is only a Trit-shift, a modulo with 3 only Trit-masking. Both operations won’t take much time on these CPUs — but currently we don’t have trinary CPUs and we have to work with what we have. Cortex M4 and M3 have hardware dividers (but they need a couple of clock-cycles for each division) but on M1 it is really bad.

So bad, that I feared it could render my project unusable if IOTA switches somewhen in the future from Kerl to Troika … So I wanted to know what can be done in the FPGA and surprisingly Troika hash rounds also can be calculated in one single clock cycle like Curl-P81 or Keccak384.

I have to explicitely note, that performance was measured on the reference implementation of Troika which certainly can be optimized a lot for binary CPU-architectures.

The image above directly compares Kekkac384 with Troika. Former runs really well on all binary CPU-architectures. Latter propably only on trinary.

If you ask me if I think that Troika is light-weight … I would say no … Not on binary CPUs and currently we don’t have others (hopefully coming soon).

But that doesn’t mean Troika isn’t usable in the interim for this reasons:
- Troika probably won’t be used before 2020 and
- if Troika proofes secure enough, number of rounds could be decreased.
- Additionally, reference implementation can certainly be optimized a lot for binary CPU architectures, e.g. working with look-up-tables to avoid div by 3 and mod with 3 calculations. The large memory footprint (6kB constant table) can be reduced to 16 Byte ring-buffer by calculating the LFSR (of course this has impact on the calculation time) instead of looking it up and so on ...

Will the Crypto Core FPGA be usable?

Without accelerators I would have said absolutely no … The M1 is a lot slower than expected but the picture turned with acceleration. Performance is quite good in most disciplines so I have a good feeling about the further progress of this project.

Risks

The start was smooth and the critical components are working and fit in the FPGA but it could turn out that internal memory of the Cortex M1 is insufficient to do everything I wanted because it’s limited to about 256kB in total but the RAM and ROM can be resized. For instance it’s possible to increase RAM and decrease ROM and vice versa. So I’m confident it should work :-)

Next Steps

The Cortex M1 is running, the accelerators are developed. The next on the list is to attach a secure element to the FPGA and to implement AES for secure communication. I’s not been decided yet if AES will be done by the Cortex M1 or by an accelerator. On the one side it will depend on speed and on the other on logic-utilization.

Thank you for reading this much text :-)