Developing applications for Cloud FPGAs are easier than you think

10 min readAug 23, 2021

Developing manycore accelerator using 1st Class Framework

Hi everyone,

I am Vineet Jain, a final year Electronics and Communication undergrad at LNMIIT, Jaipur. I am penning this blog to describe my journey in open source and contributions made over this summer as a Google Summer of Code participant under the Free and Open Source Silicon Foundation with the project titled WARP-V Many-Core in the Cloud.

Abstract

WARP-V is a highly configurable, adaptable, and flexible open-source CPU core generator. It supports various ISA like MIPS, RISC-V and can even be configured with a custom one. One can tune it accordingly to requirements from a Single-cycle, Low power, Frequency FPGA implementation to High-Frequency 6-cycle ASIC Design. It is designed using an emerging “Transaction-Level” methodology that takes advantage of advanced digital design features, timing-abstract nature of language features provided by TL-Verilog, and provides a far better abstraction over traditional used HDLs like Verilog, VHDL.

After CPU and GPUs, FPGAs are another wave of cloud computing that opens new opportunities for new and exciting computation models. Due to its highly configurable logic blocks and parallelization, it helps in accelerating compute-intensive cloud workloads. But all this comes with a tradeoff, as FPGAs are expensive and not freely available to everyone.

1st CLaaS is an emerging generic framework that helps in eliminating these up-front costs by providing support for developing custom hardware accelerators and integrating them with web applications and cloud infrastructure, bringing it within reach of everyone.

About Project

This Project is aimed to harnesses these advantages brought by Warp-V and 1st CLaaS framework by providing support for the multicore NoC design, building a custom kernel interface, and accelerate the complex computation of web applications by deploying it on the cloud.

This project aims not only to provide a highly configurable many-core hardware accelerator in the cloud but will also drive the industry toward FPGA-accelerated web applications and cloud computing, also demonstrating the flexibility of TL-Verilog to motivate the industry toward better design methodology.

Coding Round begins-

Phase-1: 💻

Debugging of Manycore and build the Visualization for Insertion Ring

The first task was to fix the manycore and insertion ring hardware itself, as after a few test instructions we found that the packets were not contiguous and were not able to be pulled out of Ingress buffers.

🔷 Fixed the issue with INGRESS Buffer

The Ring interconnect in itself was functioning properly for a small set of stimulus, and the flits were getting stored in the Ingress FIFO at the destination end. But the Core itself was speculatively reading the flits every time, this behavior was leading flits to get dropped if they were on a bad-path(shadows of a replay). To solve this problem, a logic was added to do forcefully replay of packet reads which were in goodpath every time, hence making it non-speculative in nature. Although this approach is leading to replay every flit in the packet but is more robust and faultless.

🔷 Visualization for the Memory banks and support for Multiple Cores

Initially, the Visual diagram of Warp-V shows only a single memory bank, but RISCV also supports Loads/Store instruction which only operates on Bytes(8-bit) rather than a complete Word(32-bit). Hence, we decided to extend the VIZ to support multiple memory banks. Another student Mayank Karba did a fantastic job in incorporating cool animations into the diagram giving it a real feel. Along with him, I added support for Multiple Cores (aligned vertically) as well.

WARP-V Visualization (with insertion ring on the side)

Have a look at this Sandbox link -

🔷 Architectural changes and Visualization for Insertion Ring

The Insertion Ring was behaving correctly for small inputs but for large inputs, the packets were getting lost and following a different path, missing their contiguity and its order. Since the design was a bit complex and interconnected it was difficult to pull the waveforms every time and see the behavior of a particular module and its impact on others. To tackle this problem, the easiest approach we came out with was to use the Visual Debug feature of Makerchip which helps in debugging by abstracting our design at a high visual level and this helps to get an idea of the overall functionality of separate pipelines.

This was a challenging task as monitoring and managing each and every packet in a rather complex design is quite difficult, but thanks to the TL-Verilog wild card construct ANY, which pulls signal from the stage in is consumed to the place where it is assigned and stitches the necessary FF in between. Hence, in this case, a Unique Id field (UID) is attached to the flits coming out of Core into the Ring (though Egress port), and at any point, a packet can be referenced by its UID field to animate it along the path.

This took a bit a longer time than it should have but finally, we got visualization that helps in fixing bugs.

Blue color means data is “Valid” and Red means the “blocked” (or not ready) is asserted from upstream.
In rectangle boxes, Pink means “header flit”, Green means “normal flit” and Blue means “tail flit”.

And finally, we got the issue.

a. Packets were always taking deflected path (i.e. going upwards into the FIFO), it should be the other way around i.e. flits should follow the horizontal path until this particular core is ejecting into the ring(in that case it should follow the above path).

b. The Header flits generation at Egress port was not consistent, i.e. the order was {src, vc, dest} but it should be {vc, src, dest}. Because of this, the packets were received at a different VC (or Virtual Channel) hence, causing the issue.

To solve the above problems a slight modification to the architecture was made and the image below shows that.

https://docs.google.com/drawings/u/1/d/1VS_oaNYgT3p4b64nGSAjs5FKSvs_-s8OTfTY4gQjX6o/preview — Insertion RIng Diagram

One can spend as much time in making attractive and fancy visualization, but the whole purpose of Visual debug is to help in debugging and visualizing our logic in lesser time possible.

Hence, we decided to make the viz a bit more abstract and easy to make. Also added support different colors for multiple VCs as well. The Red Dot in the above image represents the nodes for the VIZ of the insertion ring. Although it might not be fancier looking, it was built in under 30 mins (no joke).

Phase-2: 💻

Building Manycore kernel to be deployed on Cloud FPGAs using AWS F1

After completing the functionality check of the manycore versions of warp-v and debugging it. It was now the time to take this whole thing into the cloud (but step by step).

🔷 Modification to WARP-V Kernel

As things get older they begin to rust. The same occurred with the already built warp-v kernel. The issue was with missing older functions and macros definitions which were changed and moved out to different places as the WARP-V was evolving over in the meantime. So, the task was to fix that and get the warp-v kernel back to its working state. To do this we used the 1st CLaaS default template, which changes I/O for the kernel (AXI-stream interface) supporting development in both Makerchip and 1st CLaaS local environment.

The development of web app and whole infrastructure in 1st CLaaS local environment is pretty easy and simple. The below diagram shows profiling reports generated by Xilinx tools.

WARP-V Application timeline in hardware-emulation mode

WARP-V Real-time Simulation Waveform in hardware-emulation mode

🔷 Build support for Manycore Kernel

After getting WARP-V up and working it was time to get into manycore. But we faced few challenges. Firstly, the memory was just a simpler one and only support 1 write and 1 read operation. In our case, we needed a mechanism for all the Cores to get Instructions from the web client to operate according. This method might it a bit uncomfortable from the user’s point of view as they have to load them separately, etc. Hence, we decided to have the same Global Memory from which all Cores can fetch the data and operates according. In the Global Memory, no. of read ports equals to no. of Cores so each core can read separately.

Secondly, the kernel is communicating to the host application and shell via the single AXI-stream interface. So we decided to attach Core-0 with the I/O part of the kernel and the communication we happen through Core-0 (as it was the easiest approach to take)

After a few modifications and fixes, we managed to get Manycore working in the local development environment. The below video shows the process and running the “Hello Core” program on 5-cores, in which each Core spits out its Index number and sent to Core-0 which he adds all of them and sends back to the web client.

Demo of the manycore kernel using 1st Class (local) Environment

Have a look at this sandbox link (manycore in Makerchip)

🔷 Using AWS F1 Instance to test for correctness on Actual Hardware

After the previous step, now the next task was to take the design to actual FPGA and use Xilinx tools the optimize the design. In the process, few changes to the script were to make-

a. Added Reset button to the web client, which will stop and release the OpenCL function calls the exit the host application. Also made changes to relate Host and server code accordingly.

b. Sandpiper-saas to latest version and changed python version and few links to solve dependencies issues.

c. In Terraform script fixed a bug that leads to AWS credential failure on the F1 instance (it was extra space, xD).

d. few debug packages and added sdaccel.ini to overwrite and generate profiling report and waveform.

e. Added Xilinx Parametrized Bram (XPM) macro to be used as memory because earlier designs were taking a long time to synthesize due to tools not being able to infer Memory properly.

After these steps, I was able to get Mandelbrot, Vector-Addition application up and running on the Cloud (F1 instance).

Let’s see Mandelbrot in action -
(I have already gone through the process of synthesis, implementation, and generated bitstream. Hence I will load it using PREBUILT)

Mandelbrot Demo Application of F1 Instance

Also the Vector Addition example -

Let's go through the steps to build your custom kernel

Step-1: got to the 1st-CLaaS repo and clone it (you might want to fork it first).

git clone https://github.com/stevehoover/1st-CLaaS.git
cd 1st-CLaaS

Step-2: run Init script

./init

Step-3: Copy the Vector addition example application

cd apps/vadd/build
make copy_app APP_NAME=toy

Step-4: Modify kernel (and web client if you want)

cd <repo>/apps/toy/fpga/src
vim toy_kernel.tlv # Modify this

Step-5: Run the Application

cd <repo>/apps/toy/build
make launch

Open http://localhost:8000/index.html and see your hardware in action.

AND DONT FORGET to STOP YOUR INSTANCE when working with AWS Instance.

For more info about different modes and running over F1 instances refer to the Getting Started Guide, Developers Guide, and Getting Started with AWS and F1 Guide on the 1st-CLaaS repo.

Learning along the way -

The most important thing I learned was to manage my time and using resources well enough, as TIME=MONEY (AWS instance taught me this).
Importance of planning and organizing tasks before executing them. When working on a big project, It becomes crucial that every task is properly segregated and planned before the actual work begins. And the efficiency of work is increased by proper structure.
Improved my collaboration and communication skills.
Sharpened my skills in GIT and writing skills.

Challenges faced —

Faced some AWS platform issues regarding permissions, setting up AMI, getting Remmina to work, and got quite expensive F1 ($1.65/hour) and C4-2x large($0.85/hour).
Although the Xilinx tools stack is already well maintained, we faced quite annoying dependence issues. The tools are not back-compatible and we faced few issues in setting up the Xilinx SDAccel 2018.3 environment.
Even after 2 weeks of work, I wasn’t able to run Manycore on Actual FPGA(i.e. in HW mode). It took me quite a while to understand the issue with congestion (~7) was occurring due to memory not being inferred in Bram/Rams, etc.
Development VIZ for a big model like manycore on Makerchip was taking quite a time to compile (until debug mode in Makerchip comes up).

Future enhancements:🚀

Fix the remaining issue with Manycore not behaving properly on Real FPGA hardware.
Add support for individual VIZ for TLV library components.
Integrate the 1st-CLaaS version of warp-v and manycore kernel into the warp-v.org configurator.

Links to my Commits

Here are links-

WARP-V repo: https://github.com/stevehoover/warp-v/pull/85
https://github.com/stevehoover/warp-v/commits?author=vineetjain07

1st-CLaaS repo: https://github.com/stevehoover/1st-CLaaS/pull/63
https://github.com/stevehoover/1st-CLaaS/commits?author=vineetjain07

Final Words:

In the end, I just want to thank my mentors Shivam Potdar, Steve Hoover, and Ákos Hadnagy for their enormous helping in me in achieving this goal, and for keep motivating me to think forward. I am extremely happy to have them.
I would also like to congratulate fellow GSoC participants (Ninad Jangle, Bala Dhinesh, Nitin Mishra)and their mentors for achieving mind-blowing results in their projects in the Tl-Verilog domain(do read their blog).
Special thanks to Mayank Kabra and Shivani Shah for helping throughout the project.