Using FPGAs to Accelerate C/C++ (Loop Unrolling)
If you aren’t particularly familiar with FPGAs, you might be wondering how they can be used to accelerate your C/C++ application. Well, one important aspect of FPGAs is that their programmable fabric can be configured to perform any set of logical and arithmetic functions you desire. Also, multiple functions can be executing in the FPGA simultaneously.
In my previous column, “Using FPGAs to Accelerate C/C++ (Pipelining),” we discussed how a series of logical and arithmetic operations in a C/C++ loop could be implemented as a chain of functions in an FPGA. However, unlike the C/C++ code in which only one operation can be performed at any particular time, all of the logical and arithmetic functions can be performed simultaneously in the FPGA. Depending on the number of loop iterations and the number of operations in the loop, the result can be to dramatically accelerate the application.
But pipelining is just one or the tricks we have up our sleeves. Another technique is known as loop unrolling. We will describe this approach in detail in a moment, but first we need a suitable example to play with, and what could be better than Conway’s Game of Life?
Conway’s Game of Life
Almost every software developer on the planet — along with most hardware design engineers — is familiar with the Game of Life, which is a cellular automaton devised by the British mathematician John Horton Conway in 1970. I myself have created multiple implementations of this over the years. I must admit, however, that it wasn’t until I read the biography, Genius at Play: The Curious Mind of John Horton Conway by Siobhan Roberts that I discovered Conway didn’t actually have access to a computer when he performed his initial evaluations. Instead, Conway and his colleagues used the pebbles from a game of Go while experimenting with different rule systems and scenarios.
A traditional Game of Life “universe” consists of a two-dimensional array of “cells,” each of which may be “alive” or “dead.” Each cell is considered to be surrounded by eight neighboring cells (N, S, E, W, NE, NW, SE, SW). At the start of a game, we might seed our universe with random values or pre-populate it with specific patterns. The idea is to move from one generation to the next using the following rules to calculate the next-generation state of each cell:
Rule #1: Any live cell with fewer than two live neighbors dies, as if caused by underpopulation.
Rule #2: Any live cell with two or three live neighbors lives on to the next generation.
Rule #3: Any live cell with more than three live neighbors dies, as if by overpopulation.
Rule #4: Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.
Although these rules may appear to be simple, they can result in some very interesting behaviors, including patterns that can move around the array and patterns that can generate other patterns. Just to provide a feel for how all this works, let’s look at a small area of a Game of Life universe, let’s assume we are on generation ’n’, and let’s suppose that the cells are populated as shown in (a) below:
In order to evaluate generation ‘n + 1’, we have to consider each cell in isolation, looking at its eight surrounding cells to determine how many are currently alive or dead, as illustrated in (b) above. Based on this, we can decide which cells will live or die, resulting in generation ‘n + 1’ as illustrated in (c) above.
A C/C++ Implementation
Ideally, the Game of Life would be played on an infinite universe (array). In practice, we are limited by the amount of memory and/or screen real estate we have available. In many implementations, the universe “wraps around” such that we can visualize the left-hand side of the array connecting to the right-hand side and the top connecting to the bottom.
Assuming we are going for a CPU implementation, let’s take a high-level look at some nested loops that we can use to calculate the next generation. Let’s assume we have two 100 x 100 arrays called thisGen[] and nextGen[]. For the sake of simplicity, we will ignore any housekeeping tasks and boundary conditions (like handling the edges of the universe).
Observe that we aren’t considering any display functions here — this is just a case of looking at the current generation and working out what we want to do in the next generation. Assuming each test and each action (assignment) requires a single clock, my “back of the envelope” calculations tell me that it will require 735,100 clocks to decide who lives and who dies in the next generation.
An FPGA Implementation
In the case of an FPGA, we could design a little state machine to represent a single cell. We could then instantiate a 100 x 100 array of these little rascals. We could design our state machine such that, on the active edge of a clock, it evaluates the number of living cells surrounding it and decides whether it is to live or die.
The bottom line is that our FPGA implementation would require only a single array of cells, and it could generate the entire next generation in a single clock. Assuming a 100 MHz clock, this means that an entire Game of Life running on the FPGA may well have finished before the first generation had finished displaying on the screen.
Weasel Words
As always, everything we’ve discussed here involves ideal conditions (like the physics joke that only works for “toroidal chickens in a vacuum”), and there are additional considerations we haven’t discussed. The main point of all this is that, given an appropriate algorithm, an FPGA can be used to significantly accelerate portions of a C/C++ program, which is why CPU+FPGA combos are of interest for a wide variety of high-performance applications.