Scaling up your applications

Rosie Yohannan
Jun 4, 2018 · 7 min read

At we are building the next generation of developer tools, providing a highly usable service to deploy your applications using cloud-based hardware acceleration technology. Our public beta went live towards the end of 2017 and we now have 100s of users developing applications in Go for deployment to FPGAs in the AWS cloud.

FPGAs provide an opportunity to introduce massive parallelism into applications to achieve breakthrough performance improvements in the order of 10–100x over using traditional CPUs alone. We’re focussed on improving our tooling to make it quick and easy for our customers to introduce or increase parallelism in their applications to realise these gains.

Why scale up?

We introduce parallelism into our FPGA designs by writing concurrent programs in Go, so as far as possible work is broken down into light weight processes that could happen at the same time. Through our compilation process this translates to individual processes becoming individual blocks of circuitry on the FPGA, which can run concurrently.

Coding a large design with a high degree of parallelism by hand isn’t really feasible, due to large amounts of boilerplate, scope for errors to creep in etc. so our engineers are developing code generation and application scaling tools specifically for use with our service. The first release in this series is our MapReduce framework.

Why MapReduce?

One area in which low latency and high throughput are critical is finance, specifically in low latency trading and risk analysis. Our development team wanted to provide an example finance application for our users and settled on a Monte-Carlo model, which is a technique used across many disciplines to understand the impact of risk and uncertainty. The best fit for scaling the Monte-Carlo example was a MapReduce framework so we increased the scope of the project to not only show worth in the finance world, but also to provide our users with a new way to scale their applications.

Our Framework

MapReduce is a framework for processing problems with the potential for parallelism across large datasets using a number of nodes. This usually means using large numbers of computers in a network cluster or spread out geographically in a grid, but in the context of, our nodes are individual elements of circuitry on the same FPGA, or in the future, across multiple FPGAs. Put simply, you write the functions required to process the data and MapReduce farms this out to multiple nodes to introduce a high degree of parallelism, speeding up throughput.

Let’s take a simple example: Find the maximum value amongst a large set of integers, you can find the code for this here. With parallel computing at our disposal, a really speedy way to get through our sample data is to compare the integers in pairs, discarding the one with the lowest value each time, so you end up left with the maximum. These pair comparisons can happen concurrently, in phases, to speed up the analysis.

To design this without any kind of code generation would be uneconomical, time-consuming and prone to errors. Instead, we can write the main functions required for the example and use our MapReduce framework to auto-generate the code. A simplified version of the data processing in this example looks like this (I’ve added in some random numbers for demonstration purposes):


If you are familiar with programs already, you’ll know that they are made up of two bits of Go code, one for the FPGA and one for the host CPU. Our MapReduce projects are very similar, the host code is unchanged but the FPGA code generation comes from an input.go file, which contains the functions be be used in the code, and a reco.yml file which contains the settings required to generate the FPGA code using our framework:

For this simple example the contents of input.go looks like this:

package main// copy one channel to another — More functionality is required here // for more complex examples
func Deserialize(inputChan <-chan uint32, outputChan chan<- uint32) {
for {
outputChan <- <-inputChan
// copy one channel to another — More functionality is required here // for more complex examples
func Serialize(inputChan <-chan uint32, outputChan chan<- uint32) {
for {
outputChan <- <-inputChan
// return 0 to be used as an empty state for mappers and reducers
func Uint32Init() uint32 {
return 0
// functionality for mappers — return each individual integer from the sample data
func Identity(a uint32) uint32 {
return a
// functionality for reducers — return the higher value from two inputs
func Max(a uint32, b uint32) uint32 {
if a > b {
return a
} else {
return b
// So that test will still be able to run
func main() {}

And reco.yml looks like this:

type: uint32
typeWidth: 32
deserialize: Deserialize
function: Identity
replicate: 16
type: uint32
typeWidth: 32
serialize: Serialize
function: Max
depth: 4
empty: Uint32Init

Looking at reco.yml you can see:

  • The mapper stage creates 16 (replicate) instances of the function Identity
  • The reducer stage creates a tree of 4 (depth) stages of the function Max

When you generate the code for this example a framework is automatically set up to allow data elements to be pulled into available mappers, and each reducer stage takes a pair of inputs from the previous mapper or reducer and discards the lowest, until we’re left with just the highest number from the sample data.

The empty state

The empty state in the reducer is required in the event that one of the input pair for a reducer stage is awaiting a data element. For this simple example, as we’re discarding the lowest value each time, the empty state is defined as 0 (function Uint32Init returns 0) so that it’s always lowest, meaning the empty value will be discarded, rather than a piece of sample data.

We have 16 mappers in this example but the sample data length (20) is longer than that so a dispatcher is created within the framework to dispatch data elements in rounds — regardless of the specific example we’re looking at, there are always likely to be rounds where not all mappers are filled, so the empty state (0) can be used to complete these rounds.

An aggregator is also created within the framework to take care of aggregating reducer values each round. Additional reducers are instantiated on an ad-hoc basis when required, and they require 2 inputs, so the empty state is used here too, to provide a second input until an output from another mapper or reducer stage becomes available.

Getting data in and out of the FPGA — Deserialize and Serialize

Deserialize and Serialize are important functions that are used to pipe data in and out of the fabric of the FPGA. In this simple example both these functions do the same thing — copy the contents of one channel to another. In this example they are placeholders for functionality that is required for more complex examples — this will be clearer when you generate the code yourself, but this example just takes all the sample data in at once to farm out to mappers, and returns a single data element as the result. In more complex designs, sections of the sample data would need to be farmed out to multiple channels for different purposes, and multiple results will need to be collated, so Deserialize and Serialize can be written to perform the required functionality. You can take a look at our Monte-Carlo example to see more on this.

Generating the code

You can use the examples we provide to try this out for yourself. Just clone our MapReduce repository and install the framework tooling by running the following commands from a terminal:

go get
go get

You’ll also need to use Glide to vendor our SDAccel package for this example.

Next, navigate to your local copy of our Max example reco-map-reduce/examples/max. Here’s what you should see:

├── cmd
│ └── test
│ └── main.go
├── glide.yaml
├── input.go
└── reco.yml

Run `glide install` to vendor the required packages

Then you can generate a main.go file for the FPGA. First we use the structure set out in our reco.yml to lay out the framework for the code in a function called Top. To do this run:

generate-framework -output mapreduce.go

Next, to keep our compiler happy we need to create a main.go file by combining the functions that are defined in input.go and bundling them with the Top function we now have in mapreduce.go (we’re currently working on a fix so we can remove this step…):

bundle -prefix " " -o main.go .

So now we have a runnable parallelised project that you can simulate, build and deploy as normal using our reco tooling:

├── cmd
│ └── test
│ └── main.go

├── glide.yaml
├── input.go
├── main.go
├── mapreduce.go
└── reco.yml


There are a number of constraints around the kind of example for which MapReduce is a good fit:

  • Mappers need to work with a single input element and produce a single output element.
  • Reducers need to combine two output elements into a single output element, in a way that is associative (e.g.max(max(a,b), c) == max(a, max(b, c)) )
  • The Reducer also needs an initial value — for those familiar with abstract algebra is called a ‘monoid’.

The future

Our MapReduce framework is provided as an open source template for you to expand on to fit a larger number of use cases. We would love to hear how you get on with customizing it for scaling up your own applications, let us know here.

The Recon

A blog from the