Benchmarking Go code running on FPGAs

Go comes with out of the box benchmarking support that we can use to benchmark kernels for

Things you’ll need

What we’ll do

We will take the program from the our second tutorial and add a new benchmarking command to it. Then, we’ll then simulate the benchmark to verify it works. After that, we’ll build our kernel, and do a real benchmark.

Downloading the example

First, let’s download the example kernel from Github.

$ git clone
$ cd examples

Next, setup a project, using the histogram-array example.

$ cd histogram-array
$ reco projects create histogram-bench
$ reco projects set histogram-bench

Creating the bench-histogram command

If we look at the project, it already has a single command, test-histogram.

$ tree
├── cmd
│ └── test-histogram
│ └── main.go
├── main.go

We need to create a new command so that we can run the benchmark. Let’s call it bench-histogram.

$ mkdir cmd/bench-histogram

We’re going to use Go’s benchmark support, but outside of the normal go test machinery. To do this, we'll need to create a function the benchmark machinery can call multiple times. Add the code below to the file cmd/bench-histogram/main.go .

package main
import (
const (
func BenchmarkKernel(world xcl.World, B *testing.B) {
// Get our program
program := world.Import("kernel_test")
defer program.Release()
// Get our kernel
krnl := program.GetKernel("reconfigure_io_sdaccel_builder_stub_0_1")
defer krnl.Release()
// For each iteration, we process 4 bytes
// We need to create an input the size of B.N, so that the kernel
// iterates B.N times
input := make([]uint32, B.N)
// seed it with 20 random values, bound to 0 - 2**16
for i, _ := range input {
input[i] = uint32(uint16(rand.Uint32()))
// Create input buffer
buff := world.Malloc(xcl.ReadOnly, uint(binary.Size(input)))
defer buff.Free()
// Create input buffer
resp := make([]byte, 4*HISTOGRAM_WIDTH)
outputBuff := world.Malloc(xcl.ReadWrite, uint(binary.Size(resp)))
defer outputBuff.Free()
// Write input buffer
binary.Write(buff.Writer(), binary.LittleEndian, &input)
// Clear output buffer
binary.Write(outputBuff.Writer(), binary.LittleEndian, &resp)
// Set args
krnl.SetMemoryArg(0, buff)
krnl.SetMemoryArg(1, outputBuff)
krnl.SetArg(2, uint32(len(input)))
// Reset the timer so that we only measure runtime of the kernel
krnl.Run(1, 1, 1)
func main() {
// Create the world
world := xcl.NewWorld()
defer world.Release()
// Create a function that the benchmarking machinery can call
f := func(B *testing.B) {
BenchmarkKernel(world, B)
// Benchmark it
result := testing.Benchmark(f)
// Print the result
fmt.Printf("%s\n", result.String())

You just created a benchmark similar to how you would run it in go test, but with the normal setup associated with running a kernel.

Testing it

Now we just need to test it, using reco .

$ reco test run bench-histogram
2017-10-27 15:03:02| preparing simulation
2017-10-27 15:03:03| done
2017-10-27 15:03:03| archiving
2017-10-27 15:03:03| done
2017-10-27 15:03:03| uploading ..
2017-10-27 15:03:04| done
2017-10-27 15:03:04| running simulation
2017-10-27 15:03:04|
2017-10-27 15:03:04| you can run "reco simulation log 5e5ac706-7bfb-469d-b18c-47feb902812b" to manually stream logs
2017-10-27 15:03:04| getting simulation details
2017-10-27 15:03:04| status: QUEUED
2017-10-27 15:03:04| this may take several minutes
2017-10-27 15:03:04| waiting for simulation to start
<build output output>
NFO: [XOCC 60-586] Created /mnt/.reco-work/sdaccel/dist/xclbin/kernel_test.hw_emu.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin
INFO: [XOCC 60-791] Total elapsed time: 0h 1m 27s
INFO: [SDx-EM 01] Hardware emulation runs detailed simulation underneath. It may take long time for large data set. Please use a small dataset for faster execution. You can still get performance trend for your kernel with smaller dataset.
. 1 42092264059 ns/op 0.00 MB/s
. INFO: [SDx-EM 22] [Wall clock time: 19:56, Emulation time: 0.056326 ms] Data transfer between kernel(s) and global memory(s)
BANK0 RD = 0.004 KB WR = 2.000 KB
BANK1 RD = 0.000 KB WR = 0.000 KB
BANK2 RD = 0.000 KB WR = 0.000 KB
BANK3 RD = 0.000 KB WR = 0.000 KB

It’s easy to miss in the simulation debug output, but the following line is the output of the benchmark:

.        1     42092264059 ns/op          0.00 MB/s

That’s really bad, but also expected because we’re running a simulator underneath, which is a lot slower than running on actual hardware. In this case, the actual startup time of the simulator causes the benchmark framework to think it has run enough to get good results.

Testing on actual hardware

Now we can create an actual build to benchmark our code on real hardware.

$ reco build run
2017-10-27 15:07:56| preparing build
2017-10-27 15:07:56| done. Build id: <build id>
2017-10-27 15:07:56| archiving
2017-10-27 15:07:57| done
2017-10-27 15:07:57| uploading ..
2017-10-27 15:07:57| done
2017-10-27 15:07:57|
2017-10-27 15:07:57| you can run "reco build log <build id>" to manually stream logs
2017-10-27 15:07:57| getting build details
2017-10-27 15:07:57| status: QUEUED
2017-10-27 15:07:57| this will take at least 4 hours
<lots of output>

We’re going to need to wait for that to finish, which will take around 4 hours. Once that’s done we can run it on an actual FPGA (substituting in your own build ID):

$ reco deployment run <build id> bench-histogram

After a few minutes, that should deploy and run.

2017-10-30 15:07:12| creating deployment .
2017-10-30 15:07:13| done. Deployment id: <deployment id>
2017-10-30 15:07:13|
2017-10-30 15:07:13| you can run "reco deployment log <deployment id>" to manually stream logs
2017-10-30 15:07:13| getting deployment details
2017-10-30 15:07:13| status: WAITING
2017-10-30 15:07:13| this may take several minutes
2017-10-30 15:07:13| waiting for deployment to start
<lots of output>
xclProbe found 1 FPGA slots with XDMA driver running
Device/Slot[0] (/dev/xdma0, 0:0:1d.0)
. 10000000 90 ns/op 44.41 MB/s

These numbers are a lot better than a simulation, and tell us that for each iteration of our loop, the we’re spending 90 nanoseconds to process an element. That’s a lot more than you’d expect from a CPU, but that’s expected: FPGAs generally run more than 10 times slower than CPUs, but make up for it in parallel processing. We’re using a tiny portion of the FPGA right now, instead of the full processing capability. In later tutorials, we’ll explore how to scale out your program to take advantage of more of the FPGA.

You can use this approach when developing your own kernels. Try it out, and let us know how it goes on our forum.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.