Neural Network on the Tip of a Pencil

Mapping 2-Dimensional Algorithms to Hardware with FPGAs

Daniel Hensley
Edge Analytics


Written by Daniel Hensley, Blayne Kettlewell, Lina A. Colucci, and Sidney Primas at Edge Analytics

CPUs compute along one-dimension: sequentially in time. Algorithms are broken down into instructions that are always loaded and executed one after another. In the future, computation in two-dimensions (2D) will be the norm via hardware (HW) accelerators that support parallel execution over space. This will unify the exploitation of algorithm and HW structure for faster and more efficient solutions.

We’ve seen this trend partially realized with the rise of GPUs that support 2D computation. This has led to huge performance increases in many “embarrassingly” parallel applications.

[Left] A GPU is optimal for embarrassingly parallel image-processing algorithms. The video is smooth. [Right] The serial limitations of the CPU result in noticeable delays and poor performance. Source for the videos

However, GPUs only accelerate specific algorithms in specific situations. In the future, interconnected CPUs and various HW accelerators will allow seamless hyper-optimization. FPGAs, which allow for extreme customization of 2D computation by programming the hardware fabric, will be a major part of this future. For a deeper dive into the new era of HW/SW co-design, we highly recommend this talk by Chris Lattner.

[Left] Depiction of how a CPU solves a problem in 1D with serial execution of instructions one at time. [Right] Depiction of how HW accelerators such as GPUs, FPGAs, and ASICs can map computation in 2D with parallel computation in space and over time.

In this blog, we show how to map a concrete neural network for sleep tracking onto an FPGA. More importantly, we demonstrate key tools necessary to map hardware to an algorithm today and discuss how we’ll get to seamless heterogeneous compute tomorrow.

Sleep Tracker: Neural Network on the Tip of a Pencil

We made a wearable FPGA-based sleep tracker. In the process, we built a pipeline that allows us to map a neural network originally described in Python (Keras) to silicon fabric (FPGA). The entire sleep tracker — from data acquisition to the neural network predictions — is running entirely on a tiny FPGA with no processor in the loop.

We deployed 3-layer feedforward neural networks (left) on an FPGA that is smaller than the tip of a pencil (middle, circled in green) and found in the iCE40 Ultra Wearable Development Platform (right).

As you can see in the video, the user’s sleep state is classified directly on the device.

We leveraged a peer-reviewed implementation of algorithms developed by the University of Michigan (Walch, et. al. Sleep, 2019); this is the first open-source sleep dataset and corresponding algorithm repository of its kind.

We validated our FPGA neural network (NN) core against labeled data from this project. The neural network we used is a multilayer perceptron that takes accelerometry, heart rate, and circadian rhythm data as an input, and predicts wake, REM sleep, and non-REM sleep with 91.3% overall accuracy.

In this blog post, we introduce our open-source version of these Python algorithms deployed on wearable FPGA hardware. You can see the full technical details and source code here.

High level architecture diagram of our FPGA sleep tracker. On the front end, an STM IMU chip is the data source connected to our FPGA. On the backend, we can connect a host computer to read off data and issue commands to the sleep tracker. We developed our own sampler, featurizer, NN core, and UART core from scratch in SystemVerilog. This allowed us to realize an efficient solution deployable on tiny FPGAs.

The parametrically defined FPGA NN core we built is vendor-independent and applicable beyond this application as our pipeline allows easy updating of model shapes and parameters, within certain constraints.

The Future of 2D Algorithms on Adaptable Accelerators

Deploying adaptable accelerators such as FPGAs is high friction and time-consuming today. We’ll describe three aspects of FPGA development in terms of what we did today and how it will improve in the future.

2D algorithms will be described at a high level (e.g., Python) and automatically deployed

Writing FPGA code is an arcane task that requires different expertise than what is typical for data scientists and most software engineers. This can be a barrier for teams that would otherwise greatly benefit from 2D FPGA-based acceleration. The ability to describe FPGA-targeted algorithms in familiar high-level languages such as Python is critical to democratizing FPGA use. FPGA experts will also benefit from the major efficiency gains with this infrastructure.

To deploy a new sleep tracker network in our application, a user only needs to run a script and lightly modify a couple of files. No hardware knowledge is required and there is no need to write new SystemVerilog code.

Our NN deploy pipeline enables non-FPGA experts to train new models and deploy to FPGAs. The steps of the pipeline include training and exporting a Keras model, using a script to parse the output model into data the FPGA compiler toolchain expects, updating model constants in the top-level FPGA project file, and re-compiling the FPGA bitstream.

This works because we only allow a highly constrained set of models. More general High Level Synthesis (HLS) tools such as Xilinx’s Vitis HLS and Google’s XLS will, in the future, allow users to provide generic, high-level descriptions of algorithms they want deployed to adaptable accelerators.

2D algorithms will be efficiently tested and debugged in languages like Python

Simulation, validation, and debugging are critical parts of the design process for FPGA applications. These processes will also see major improvements from high level interfaces and tools.

Diagram of the top-level test bench for our sleep tracker application. The ability to use Cocotb and Python is a boon for testing, validating, and debugging FPGA designs. The convenience of Python async/await syntax and the ease of mocking subcomponents/importing test data in Python greatly accelerated our work.

There is great progress already. For example, although we wrote all of our components directly in SystemVerilog, we used Cocotb for all of our off-device validation and test benches — for each module and the sleep application as a whole — without ever leaving Python. With Cocotb, we can wield the cycle-accurate simulation so important for FPGA validation in the Python ecosystem that is so efficient for developers.

Rust will be the glue that holds heterogeneous systems together

A common scenario for embedded engineers is hooking up communication between a HW accelerator, such as an FPGA, and a host CPU. This work is notoriously tedious and buggy.

We built our FPGA sleep app device driver and higher level Session API in Rust. The former implemented our custom packet protocol and we used the latter to create various programs to interact with the FPGA sleep app. Rust is a great solution because its type system and static checks make it much easier to write safe low-level code and ergonomic higher-level APIs. The second half of this talk describes some of these features in detail.

We believe Rust is the best choice to glue together heterogeneous compute systems. In this role, Rust will provide safety in low-level communications, reduce driver fragility, and provide ergonomic APIs for algorithms to communicate across HW boundaries.

A vision for using Rust as CPU <-> FPGA glue, including code generation facilities for common classes of I/O.

We published our open source repository with additional technical details here.

The FPGA work here was certainly a team effort by the Edge Analytics team! A major thank you to Blayne Kettlewell, Andrew Weitz, and Vasiliy Nerozin for all their help building the tools and code behind this work.

Edge Analytics is a company that specializes in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at