Introducing Blaze: ZK Acceleration for FPGA

Ingonyama
11 min readMay 15, 2023

--

by Immanuel Segol

Blaze logo

At Ingonyama, we are developing FPGA acceleration solutions for common ZK primitives with the goal of making FPGA acceleration as developer friendly as GPU acceleration.

To this end, we introduce Blaze, a Rust library for ZK acceleration on Xilinx FPGAs. Our goal with Blaze is to make FPGA acceleration accessible to ZK developers.

What is Blaze?

Blaze is a Rust library that allows access to Ingonyama’s implementation of primitives such as MSM, NTT, and Poseidon hash on an FPGA, without much hassle or overhead. Blaze will abstract away the complexities of reading/writing from FPGAs and flashing FPGAs.

It is also possible to use Blaze with your own FPGA programs. However, you will need to define your own configuration as well as implement a client.

Currently, we support the C1100/U55C Xilinx FPGA card.

Who is Blaze for?

Blaze is for anyone who requires high-performance ZK primitives in their application.

We envision Blaze being used by protocol developers and proof providers wanting to upgrade existing protocols with FPGA acceleration for increased performance.

What problem is Blaze trying to solve?

Before Blaze, when your team would want to design an FPGA and write software integration for it, the software development team would have to be very knowledgeable regarding interfacing with FPGAs and would need to write its integration logic at the Driver level.

Blaze now allows the software team to simply get a configuration file from the Hardware team and interact with the high-level Blaze API.

Blaze — architecture overview

Blaze is a library for interacting with FPGAs (DriverClient in the diagram). Blaze implements low level methods for interacting with the XDMA and other parts of the FPGA. Blaze allows developers to define FPGA APIs in the form of JSON objects, from these, we may generate clients that implement high-level easy-to-use drivers.

We can combine Blaze with our Driver Manager (future feature release). The Driver Manager keeps a pool of DriverClients representing the existing FPGAs on the machine. Applications communicate with the Driver Manager by sending requests in the form of data written to shared memory and a payload with request details. The Driver Manager will then orchestrate the task requests between available FPGAs Blaze clients according to availability and task load.

Getting Started with Blaze

Before we dive into the Blaze API, we would like to provide an overview of some basic FPGA concepts and terminology. Though it’s not necessary to be familiar with this terminology to use Blaze out of the box, we think it’s important to understand the protocols and interfaces Blaze is abstracting away.

What is an FPGA?

Field Programmable Gate Arrays (FPGAs) are semiconductors, but unlike other integrated circuits such as ASICs, you can think of an FPGA as a clean slate. It is as close a developer can get to designing custom hardware, without actually needing to manufacture anything. Out-of-the-box FPGAs don’t do anything at all but can be programmed to do anything you wish. They are also nearly stateless; the moment you switch them off they are reset.

High-level overview of an FPGA

In general, FPGAs have three main components:

  1. Logic cells
  2. Interconnects
  3. IO blocks

A TL;DR of these components is that logic cells are state machines allowing you to define your program. A logic cell is constructed of lookup tables (these act as RAM for combinatorial logic functions), flip-flops for storing state information, and multiplexers these route logic between elements of the block and external resources as well.

Interconnects connect the Logic cells and route data between them, and IO cells allow the FPGA to read and write to the outside world via different interfaces. All of these components can be programmed with DSL languages; one example would be Verilog.

We will now briefly touch on some of the main components used in an FPGA environment, just to solidify a high-level understanding.

IP Cores

IP cores are “modules” designed and tested by 3rd party companies such as AMD / Xilinx or another team member. These modules come in both software and hardware form and offer thoroughly tested solutions for common challenges FPGA designers may encounter.

AXI

The AXI (Advanced eXtensible Interface) is a communication protocol used primarily within FPGAs, ASICs, and SoCs (System-on-Chip) for facilitating communication between different components, such as processor cores, memory controllers, and peripherals. AXI is part of the AMBA (Advanced Microcontroller Bus Architecture) family of protocols, which is developed by ARM. You can think of AXI as an efficient internal communication protocol helping the FPGA’s inner components speak with each other.

HBM

High Bandwidth Memory (HBM) is a high-performance memory technology that is integrated closely with the FPGA to provide high memory bandwidth and low power consumption. HBM is particularly useful for applications that require large amounts of data to be processed quickly (MSM for example).

HBM is an actual physical component in the FPGA; uses 3D-stacked memory architecture, delivering superior performance than traditional DDR RAM. The HBM memory controller, usually a hard IP core, uses AXI to communicate with other components in the FPGA.

DMA

DMA (Direct Memory Access) allows for data transfer between external devices and the FPGA without the need for a processor to be involved (DMA can also facilitate internal data transfer between FPGA components).

Xilinx provides an IP Core called XDMA (Xillinx DMA). XDMA is specifically designed for high-performance data transfers between the FPGA and the host system over the PCIe (Peripheral Component Interconnect Express) interface.

HBI

The Host Bus Interface in an FPGA serves as the interface, or bus, that connects the host system to the FPGA, enabling communication, control, and data transfers between them.

Introduction to Blaze

Blaze is our solution for developer-friendly FPGA integration. Blaze abstracts away the complexities of interacting with an FPGA and provides developers with a simple and composable way to build FPGA-accelerated applications.

Blaze also makes it easy for developers to define application-specific clients (or use Ingonyama clients), which can easily harness the power of Ingonyama’s FPGA solutions for many of the most important ZK primitives.

FPGAs can be pooled together to perform a specific task, or each can run its own program individually. Blaze is stateless at the moment, however, we are working on a management layer that can be used to seamlessly orchestrate multiple devices.

Setting up a project

What platforms are supported?

We have been primarily developing and testing on the Xilinx C1100/U55C installed locally.

While our current release does not support AWS F1 Instances, an upcoming release will.

Flashing Device

First you should make sure your FPGAs are set up correctly and flash them using xbflash2 utility.

For more details have a look at this repo.

Warpshell

Warpshell is the FPGA OS we use for developing all of our accelerators.

It offers a couple of very interesting features.

  1. Firewall — The firewall protects the FPGA from user input that may cause the FPGA to enter an undefined state and require a hard reset; this saves a lot of development time and delivers a more friendly experience.
  2. Run time configuration — Our FPGAs are able to load programs at run time allowing them to switch functionality. Programs are compiled into the form of a binary and loaded onto the FPGA.

Flash using the xbflash2 utility

First find the card BDF (Bus:Device.Function).

With the card plugged in, run:

sudo lspci -d 10ee:

If the card has an XRT compatible image loaded then you will see something like this:

01:00.0 Processing accelerators: Xilinx Corporation Device 5058
01:00.1 Processing accelerators: Xilinx Corporation Device 5059

Note the first function of each device in Bus:Device.Function (in this case 01:00.0) notation.

There should be two functions for each device while using the Xilinx default image.

Flash the C1100/U55C device to a warpshell image:

sudo ./Ingonyama_utils/xbflash2 program --spi --image 
./Ingonyama_utils/warpshell_xilinx_u55n_xdma_gen3x8_v2.mcs
--bar-offset 0x1F06000 -d <BDF>

You should see something like this:

Preparing to program flash on device: 01:00.0
Are you sure you wish to proceed? [Y/n]: y
Successfully opened /dev/xfpga/flash.m256.0
flashing via QSPI driver
Bitstream guard installed on flash @0x1002000
Extracting bitstream from MCS data:
..................
Extracted 18464340 bytes from bitstream @0x1002000
Writing bitstream to flash 0:
..................
Bitstream guard removed from flash
****************************************************
Cold reboot machine to load the new image on device.
****************************************************

Now power cycle the system and you’re in the warpshell ecosystem!

After the reboot, check lspci -d 10ee: again and you should see:

01:00.0 Processing accelerators: Xilinx Corporation Device 9038

Success!

Using Blaze

Adding Blaze to an existing Rust project

Adding Blaze to a Rust project is very simple, using cargo you can add it to your project:

cargo add ingo-blaze --git
"https://github.com/ingonyama-zk/blaze.git"

You should see the following added to your Cargo.toml

ingo-blaze = { git = "https://github.com/ingonyama-zk/blaze.git"}

Now that we have Blaze installed we can go ahead and see how to use it.

Building with Blaze

Let’s go over the Blaze API. As an example we will review some of the source code for our PoseidonClient. The PoseidonClient was created with the intention of accelerating ZK protocols that use Poseidon (Scroll and Filecoins PC2 process). PoseidonClient takes input values (256 bits each in size) and returns a tree according to the initialize parameters; this tree can reach sizes of 374 GB of input data.

After reading this you should feel comfortable using our drivers and building your own.

Defining a DriverClient

Let’s begin by creating our first DriverClient. A DriverClient is at the core of Blaze. It represents an FPGA instance, and the DriverClient exposes all methods required to read and write from the FPGA.

In the example below, we create a preconfigured (meaning the DriverConfig and driver have all been provided by Ingonyama for “out of the box” use) DriverClient and use Ingonyama’s DriverConfig::driver_client_c1100_cfg().

ingo_blaze::driver_client::dclient::{DriverClient, DriverConfig};

let dclient = DriverClient::new(
"0", DriverConfig::driver_client_c1100_cfg());

The first parameter states the card slot to be used, and this is important as your machine may have many FPGAs installed. The second parameter is the DriverConfig, which we shall cover next.

DriverConfig

The driver config is simply a JSON object representing address memory space for different components of an FPGA.

Usually, when using Ingonyama’s drivers you will just be calling DriverConfig::driver_client_c1100_cfg().

However, if you design your own hardware, you may need to create your own driver configuration. Creating your own custom DriveConfig is straightforward.

You first define a <config_file_name>.json and save it here:

{
...
"ctrl_baseaddr": "0x00000000",
"ctrl_cms_baseaddr": "0x04000000",
"ctrl_qspi_baseaddr": "0x04040000",
...
}

Currently, you then must add a method to DriverConfig which will parse your parameters and generate an object.

You can then create your driver like this:

let dclient = DriverClient::new(
"0", DriverConfig::driver_client_custom_cfg());

Defining a Primitive

Now that we have a DriverClient instance, we need to use it somewhere.

A Primitive is a wrapper around a DriverClient that should implement an API for using the driver.

The DriverPrimitive is a trait that must be implemented by any Primitive. If you are writing your own driver, for example, you must implement the DriverPrimitive which includes core functionality common across all drivers.

Let’s have a look at the Poseidon hash driver.

// PoseidonClient is the Primitive
let poseidon = PoseidonClient::new(Hash::Poseidon, dclient)

PoseidonClient takes in the DriverClient (dclient) we defined above. PoseidonClient implements all the custom logic for the Poseidon hash and uses the DriverClient to interact with the FPGA. An example of this is the reset function on PoseidonClient:

fn reset(&self) -> Result<()> {
self.dclient.set_dfx_decoupling(1)?;
self.dclient.set_dfx_decoupling(0)?;
sleep(Duration::from_millis(100));
Ok(())
}

As you can see, it calls set_dfx_decoupling from the DriverClient.

Loading programs and initializing.

With every version of a driver comes a .bin file; think of this file as the program you want to run on the FPGA.

Before using the driver you must first tell the FPGA what program you are going to run. This .bin file needs to be loaded.

Below we have an example of how we would go about loading a program.

// read the .bin file from a path
let buf = read_binary_file(&bin_file)?;
// set the FPGA into the correct state for loading a binary
poseidon.driver_client.setup_before_load_binary()?;
// load the binary
let ret = poseidon.driver_client.load_binary(buf.as_slice());

It’s important to note that we can load different programs “on the fly” programmatically, so you could use one FPGA for many operations just by reloading programs to it during runtime. Only a single program may be loaded at a time. If you wish to read more about this, the source code here is well commented and explains the entire process.

After we are done loading our program, we must initialize the FPGA. This step differs between drivers. In this article, we are covering the Poseidon driver as an example.

For the PoseidonDriver this step is very similar to load_binary but instead we load a CSV file containing all the program instructions, as well as setting the FPGA to generate a specific tree type.

let params = PoseidonInitializeParameters { 
tree_height, // how many layers are in the tree
tree_mode: TreeMode::TreeC, // tree type
instruction_path, // path to CSV file
};

poseidon.initialize(params);

Let’s have a look inside the PoseidonClient initialize method.

fn initialize(&self, param: PoseidonInitializeParameters) -> Result<()> { 
self.reset()?; // reset the firewall and other stuff

self.set_initialize_mode(true)?; // enter init mode

// load the instructions from csv
self.load_instructions(&param.instruction_path)
.map_err(|_| DriverClientError::LoadFailed {
path: param.instruction_path,
})?;

// exit init mode
self.set_initialize_mode(false)?;

// set some important parameters for the tree we are going // to generate
self.set_merkle_tree_height(param.tree_height)?;
self.set_tree_start_layer_for_tree(param.tree_mode)?;
// init the Card Management Subsystem this allows us to
// monitor card temperature for example
self.dclient.initialize_cms()?;

self.dclient.set_dma_firewall_prescale(0xFFFF)?; Ok(())
}

We are now ready to go and actually use Poseidon hash!

Reading/Writing to the FPGA

PoseidonClient implements both a read and write method for us.

fn set_data(&self, input: &[u8]) -> Result<()> { 
self.dclient .dma_write(self.dclient.cfg.dma_baseaddr, DMA_RW::OFFSET, input)?;
Ok(())
}

(Write)

self.dclient.dma_read_into(
self.dclient.cfg.dma_baseaddr,
DMA_RW::OFFSET,
result_buffer // buffer ptr to read data into
);

(Read)

We can read or write up to 1TB at a time in the case of PoseidonClient. However, reading with such sizes is not always optimal or necessary, depending on the application (reading and writing sizes are dependent on the driver’s design).

Our PoseidonClient after receiving inputs generates a tree, and returns the elements of the tree in an unordered fashion, so to avoid sorting a massive amount of data at once we split the read / write / sort across multiple threads.

The way you decide to read and write from and to the FPGA depends on a lot of details such as the sizes of data you are dealing with, latency, RAM available on the machine, etc.

// Second thread: Receive and process the buffer pointers 
while let Ok(buffer_ptr) = rx.recv() {
poseidon.set_data(buffer_ptr.lock().unwrap().get_mut
().as_ mut_slice());
}
});

(An example of a thread writing data to the FPGA in chunks)

Debugging

We have implemented some debugging tools to allow a developer to gain insight into the state of the FPGA.

Logging

Each driver implements a log_api_values method; this method will log all current states of the addresses defined for this driver.

pub fn log_api_values(&self) {
log::debug!("=== api values ===");
for api_val in INGO_POSEIDON_ADDR::iter() {
self.dclient
.ctrl_read_u32(self.dclient.cfg.ctrl_baseaddr, api_val)
.unwrap();
}
log::debug!("=== api values ===");
}

The output for the Poseidon driver looks like this:

ADDR_HIF2CPU_C_IMAGE_ID value: 1
ADDR_HIF2CPU_C_IMAGE_PARAMTERS value: 64
ADDR_CPU2HIF_C_MERKLE_TREE_HEIGHT value: 4
ADDR_CPU2HIF_C_MERKLE_TREE_START_LAYER value: 0
ADDR_CPU2HIF_C_INITIALIZATION_MODE value: 0
ADDR_HIF2CPU_C_NOF_ELEMENTS_PENDING_ON_DMA_FIFO value: 720
ADDR_HIF2CPU_C_NOF_RESULTS_PENDING_ON_DMA_FIFO value: 502
ADDR_HIF2CPU_C_MAX_RECORDED_PENDING_RESULTS value: 502
ADDR_HIF2CPU_C_NOF_CLOCKS_SPENT_ON_CURRENT_TASK_LO value: 7766236
ADDR_HIF2CPU_C_NOF_CLOCKS_SPENT_ON_CURRENT_TASK_HI value: 0
ADDR_HIF2CPU_C_LAST_HASH_ID_SENT_TO_RING value: 446
ADDR_HIF2CPU_C_LAST_ELEMENT_ID_SENT_TO_RING value: 1
ADDR_HIF2CPU_C_LAST_HASH_ID_SENT_TO_HOST value: 440
ADDR_HIF2CPU_C_LAST_LAYER_IDX_SENT_TO_HOST value: 0
ADDR_HIF2CPU_C_RING_NODE_ALMOST_FULL value: 0
ADDR_HIF2CPU_C_NOF_CLOCKS_PASSED_FROM_LAST_RING_TRANSMIT_LO value: 1020417
ADDR_HIF2CPU_C_NOF_CLOCKS_PASSED_FROM_LAST_RING_TRANSMIT_HI value: 0
ADDR_HIF2CPU_C_PROGRAM_MEMORY_INITIALIZATION_COUNTER value: 8224

Using ADDR_HIF2CPU_C_PROGRAM_MEMORY_INITIALIZATION_COUNTER, for example, you can check if you initialized your program correctly.

ADDR_HIF2CPU_C_NOF_ELEMENTS_PENDING_ON_DMA_FIFO value: 720
ADDR_HIF2CPU_C_NOF_RESULTS_PENDING_ON_DMA_FIFO value: 502
ADDR_HIF2CPU_C_MAX_RECORDED_PENDING_RESULTS value: 502

These values are extremely helpful in understanding if you are writing correctly to the DMA and if your data is being processed correctly.

Temperature and power consumption monitoring

You can use the CMS to monitor your card’s temperature.

let (temp_inst, temp_avg, temp_max) =
poseidon_temp.dclient.monitor_temperature().unwrap();

Just make sure that you have enabled CMS self.dclient.initialize_cms()?; during the initialization phase.

Conclusion

With Blaze we aim to light a fire under the integration of FPGAs into a wide array of ZK projects.

We are excited to see what the community will build with Blaze! And welcome you to contribute to the project on Github.

Follow Ingonyama

Twitter: https://twitter.com/Ingo_zk

YouTube: https://www.youtube.com/@ingo_zk

LinkedIn: https://www.linkedin.com/company/ingonyama

Join us: https://www.ingonyama.com/careers

--

--

Ingonyama

Ingonyama means Lion. We are a next-generation semiconductor company, designing accelerators for Zero Knowledge cryptography.