Simplifying full-stack FPGA development right from RTL to Software — 1st CLaaS on PYNQ ! (Part 1)

Shrihari
7 min readAug 5, 2022

--

Deploying Field Programmable Gate Arrays, beyond classroom and research prototyping, extends outside the ideology of RTL to bitstream, rather involving the development of a complete FPGA based system design on the hardware end and enabling the software stack to utilize the hardware on the “field”. While this may seem intimidating for beginners to get started, 1st CLaaS for PYNQ largely simplifies this. Developing complex FPGA based systems would seem like the generic RTL to bitstream approach, while you leave the framework to handle the complexities, at the expense of some predefined constraints (which can be modified later!)

1st CLaaS on PYNQ — a high level overview

The framework focuses on simplifying hardware design at every step. Apart from supporting Verilog for hardware design, the framework emphasizes the use of TL-Verilog, a high level HDL based on a higher level of abstraction called Transaction Level Design. 1st CLaaS (Custom Logic as a Service) brings cloud FPGAs readily deployable to accelerate the client side software functions, without the need for extensive software expertise, 1st CLaaS on PYNQ brings this to local FPGAs — PYNQ.

With the extreme ends of the development cycle — design and software, simplified with TL-Verilog and 1st CLaaS, the main focus of the framework is to provide a quick automation flow to develop and prototype hardware accelerators on Xilinx FPGAs running the PYNQ Framework. The RPHAX (Rapid Prototyping of hardware accelerators for Xilinx FPGAs) framework not only automates the IP, Block Design, Bitstream generation but also includes several features like support for High Level HDLs, integration with external IDE, connection manager for remote FPGA Lab setup etc.,

RPHAX Framework in a browser

While the above may sound complicated to a beginner, just follow along this post if you have a hardware design in your mind and you intend to develop a complete system design and write software that can utilize your hardware, all from within the browser or terminal!

You can put your hardware design (say an ML accelerator) to run a python function using a single command. Under the hood, the framework packages your design into a AXI Stream (or AXI4 Lite) based IP, creates a Zynq based block design, and delivers the bitstream.

Let us take a quick look at the usage to get you started with the framework

Clone the repository

git clone https://github.com/shariethernet/RPHAX.git

git checkout dev_s

Requirements: Vivado 2018.2, Python 3.x

Example 1 — Inversion (AXI Stream IP)

Walkthrough

Now let us take a look at the inverter IP (inverter2.tlv). The image inversion IP inverts the pixel. For example, a pixel whose value is 1 would become 255–1 = 254. By performing this operation for all the pixels in the image, the resulting image is the negative of the image. To keep things simple let us stick to the use model where we stream in the values to the kernel and get the stream of inverted values from the output. RPHAX is invoked in the generate mode. Using the -b switch, generates till the bitstream stage, else the flow halts after block design generation. To just get going with the framework we will now generate the bitstream to be used in the Jupyter Notebook

python rphax.py generate -b inverter2.tlv

(or)

python rphax.py generate -b -interface=axi_l inverter2.tlv

Generation of RTL (Verilog) from TL-Verilog Source
Packaging the RTL into an FPGA IP based on the starting template
Generation of Block Design and top-level HDL Wrapper
Generation of Bitstream for PYNQ-Z2 Development board

All runs are present in the runs directory and a unique sub-directory is created for each individual runs.

Armed with the bitstream and hardware hand-off, using the -pynq [URL] in the generate command would upload the bitstream, hardware handoff and open a Jupter Notebook Template (If the URL is available).

The Jupyter Notebook containing the software is present here. Execute the python code to see the inverted output values.

Input Buffer (from the Jupyter Notebook)

The first value of the input buffer is 0x01

Output buffer

Output buffer, showing 255–1 = 254 (0xfffffffe)

You can run the framework with different examples (more to be added) in the examples folder, or create your own kernel ! Just remember to keep the filename and the top module names identical.

Seems simple ? Let us look into what is happening under the hood

Working

Once the generate phase begins, RPHAX validates the TL-V template and invokes Sandpiper-SaaS to generate Verilog/SystemVerilog. This step can be skipped if the design is in Verilog/System Verilog/VHDL, on the other hand if the design involves usage of other HDLs or HLS, then the template and the framework must be modified to enable RTL generation. Post this step, the design will be packaged into an FPGA IP based on the starting template.

Generated AXI Stream based IP

Once, the design is packaged into an IP (AXI Stream/AXI4 Lite/AXI4), a Zynq based block design is created. The block design is also created with the configurations in the starting template. The starting template not only includes the RTL, it also contains details about various configurations, connections and parameters of the block design. The generated block design can be modified upon requirement.

Generated ZYNQ based Block design. The red arrow points to the kernel (Inverter IP)

A top-level HDL wrapper is created for the block design and then the design is synthesized, placed and routed on the FPGA Fabric to generate the bitstream and the associated power, utilization and timing reports.

When the “-b”, flag is included in the “rphax” command, then the design proceeds automatically up to bitstream generation. Otherwise, the automation flow ends at top level HDL wrapper generation.

Post Implementation
Timing report
Power report
Hierarchy wise utilization report
Utilization report

If required the run can be interrupt at any stage and a manual run can be done outside the framework.

Example 2— Adder (AXI4 Lite IP)

Walkthrough

Let us take a look at a relatively simple four bit adder example (harness_axi.tlv) While the design remains simple, packaging the design into a memory-mapped IP involves far more effort than packaging into a AXI Stream based IP. However, the complexities are handled under the hood and no difference is seen in the user end.

python rphax.py generate -b -interface axi_l harness_axi.tlv

The framework generates the bitstream and the hardware handoff file no different from the previous example

Working

You may have noticed that the absence of signal bindings in the TLV template. Since AXI4 Lite is a memory mapped IP, involving instantiation of registers depending on the inputs, outputs and their width, the framework detects the input and output ports from the template and automatically generates a top level wrapper inorder to package into an AXI4 Lite IP. The framework also performs default address mappings. The user can configure custom address mappings in the ip_create.tcl if required. In further versions of the framework an input YAML would be used to get such data and dynamically generate the TCL scripts.

First this will take harness_axis.v file for AXI stream and harness_axis.v file for AXI lite as input. The filename and module name should be fixed. Also, this file has clk and reset signals as first and second ports. This should also be fixed. The script will generate the IP from the user design. For this design to make an AXI Lite IP, we have created an ip_create_axi.tcl script template for each interface inside src directory. Since the user design will have a different number of port output and inputs, we need to properly instantiate these port names along with their widths in the IP template. So RPHAX will handle that by extracting the input and output ports along with their widths and instantiating the module, and generating the IP. Next, the bd_bitstream_axi.tcl script will generate the block design as below.

Generated AXI4 Lite Adder Block Design

The automation script for AXI Lite supports up to eight 32-bit registers excluding clock and reset signals. So we can insert up to eight input/output ports in the design.

TLV/RTL to Bitstream via RPHAX

Other examples

  • Examples from Makerchip tutorials
  • Mandelbrot (In Progress)
  • WARP-V (In Progress)

With this you can start using the framework to design your kernel and deploy it on an FPGA, and write the software to utilize your Custom Logic as a Service on PYNQ!

Upcoming Work

Some of the features in the framework are currently under development.

  • Remote FPGA Framework
  • WARP-V and/or Mandelbrot bring-up via RPHAX

Acknowledgements

This project is a part of Google Summer of Code 2022.

Organization: Free And Open Source Silicon Foundation.

Mentors

Details on PYNQ Overlays, Remote Access setup, and management, will be covered in Part 2 of the blog post.

--

--

Shrihari

MS,PhD Student @ UT Austin | GSoC ’22 | Ex- Silicon Labs | RTL Design Engineer | Founder-Technowiz