Simplifying Parallel Applications for C++, An Example Parallel Bzip2 using RaftLib with Performance Comparison

We spend way to much time building bits and pieces to put together parallel programs. How much time do we (the programmer) spend putting together boiler plate code that could easily be abstracted away by libraries like RaftLib? RaftLib enables programmers to stitch together parallel code from sequential kernels (both pipeline and task parallel), with far fewer lines of code than most other parallelization frameworks (while being just as performant). This mode of programming is known as streaming (also data-flow). Not only does RaftLib remove the burden of boiler plate coding, it also automates the optimization process. RaftLib chooses the type of memory to allocate between links, dynamically resizes them using multiple analytic models (both queueing network and Support Vector Machine learned models).

To demonstrate how easy it is to build an application, I want to demonstrate a simple app, that is widely used. I chose parallel bzip. We’ll go through a bit of RaftLib implementation of parallel bzip (using libbz2), then go on to a performance comparison of this version and the classic pbzip2 utility.

If you’ve never looked at the pbzip2 code, you’re not alone. I hadn’t either until I wanted to do this comparison. Spending about 10 minutes reviewing it, I decided it was fairly well written. At first glance, it seems like spaghetti code, however the number of global variables and error handlers is quite typical of parallel programs these days (often by necessity). Although I disdain lines of code as a metric, it is useful to understand the scale of the code base. The main file (pbzip2.cpp) contains over 4,500 lines of code, the header file an additional 300. Most of the code is devoted to book-keeping tasks (e.g., building and managing the structures for safe communication between parallel threads). To reimplement pbzip2 I need to know exactly which functions do the work of compressing, so I cross referenced the bzip library documentation with this code. Keying in on things that had the bzip library function prefix BZ2_XXX within the pbzip2.cpp file. I found that the pbzip2 authors use the buffer to buffer compress function within a worker thread, voila. Now I know exactly how I should implement my kernel within RaftLib.

I won’t have to worry about all the custom communications framework that consumes so much code within pbzip2. RaftLib manages that for me. I also don’t need to build any custom file readers. That is also handled by a stock RaftLib library call. What I will have to construct is a compress kernel, and a kernel to write the file in indexed order, since each block will be variable length (i.e., the original blocks are uncompressed and homogeneous, the compressed blocks differ based on compressibility). Let’s go over the file write kernel first. We only have to write the code as if it were sequentially executing, that is, no special care need be taken to accommodate its parallel execution (each kernel can potentially be executed as a separate thread). The only rule that must be followed is that state must be accessed only from within the kernel, or streamed in via its ports. Our filewrite kernel, looks like almost any other C/C++ code containing a file operation (granted, it’s wrapped inside a virtual “run” function). I did make one concession for this code since I wanted to benchmark it against the classic bzip2 on a per thread count basis: I added a port count parameter for the constructor, so I could control the exact number of worker threads for benchmarking (see below).

Since we’re implementing the raft::parallel_k, care must be taken to only access ports via the for loop iterator construct. This gives the run-time the freedom to add input ports as it sees fit (and prevents ports from being removed or added while the kernel is using them). The default RaftLib file reader object also extends the same raft::parallel_k class. At line 27 we see this iterator loop, accessing each port sequentially for data. In order to sequentialize the writes in the correct order, the code checks for the order id passed through by the compress kernel from the file reader. This process visually looks something like this:

On to the actual worker, the compress kernel. To build it, we need to write an object that extends raft::kernel. Then set up the run function to pull in data from an input port, compress it using the BZ2_bzBuffToBuffCompress function, and then write it to the output port. Using RaftLib, this is quite simple. We can even do it zero copy, directly using the input stream as a data source and the output stream to write the data to. We first add input ports within the constructor, with the filechunk struct as a type. This struct enables the port to contain statically sized chunks of a file, to which the filereader RaftLib library kernel attaches a block order, and file position (relative to the original file). This will allow us to maintain order when re-assembling the compressed blocks. Of course, the programmer must know to pass through the order if they intend to use it, the run-time can’t simply infer that the programmer intends on re-ordering the data within the file-write kernel.

Here’s my implementation of the compress kernel (by far the longest chunk of code in rbzip2):

I even did some basic error handling for the bzip library function. One interesting line I’d like to call out is number 87 where the CLONE() function (well, actually a macro) is inserted. This CLONE() function enables the run-time to duplicate this object as many times as needed (assuming that the programmer has given a copy constructor as well). All that is left is to define a main function, instantiate the needed kernels, and link them together.

Lets see what that looks like:

Awesomely, that is pretty much all that is necessary to write a parallel bzip compressor using the RaftLib library. It’s short and simple.

As you can see, most of the code is taken up with command line processing (using the cmdargs library). The actual kernel instantiation and linking is handled in lines 55–72. Line 72 defines a pipeline and task parallel bzip compressor topology. The “<=” tells the run-time to expand the kernels after it and before the “>=” operator, to match the number of input and output ports available. This means that the compress kernel will be automatically duplicated by the run-time, all the programmer has to do is specify it once.

The full code is available here. Total time authoring and debugging the code, about 20 minutes. Looking at the original version of pbzip2, I’m fairly certain that it took a lot longer than that.



Test Platform: 2008 Mac Pro with 2x Intel Xeon E5472 Processors, 32GB DDR2-FBDIMM @ 800MHz and two 512 GB SSD on SATA2, OS X El Capitan 10.11.2

Test Software: Apple LLVM version 7.0.2, Pbzip2(1.1.12), Apple default bzip library version 1.0.5 for both RaftLib and pbzip2 to link against. Interestingly (although not surprisingly) the OS X default library version on all my Macs differ from header (for 1.0.6). For both parallel bzip libaries (pbzip2 and rbzip2), the knobs were turned to equivalent positions (block size = 9 and effort = 250). Optimization flags used for compilation were: -O2 -mtune=native.

Methodology: For a test file to compress, I used the first 10GB of the Stack Exchange archive. To cut the file down on OS X, you can do the following:

$> perl -e 'print 10*(2**30)' | pbcopy
$> head -c [ctrl-v] [stack_exchange_posts] > 10gfile

I created a RAM disk to test out the IO requirements on a smaller file (exactly 1GB) before moving on to the bigger file to ensure I wasn’t limited by my disk or memory system. On OS X, you can do this with the below commands, for Linux this is well documented elsewhere:

$> perl -e 'print (2**30)' | pbcopy
$> diskutil erasevolume HFS+ 'RAMDisk' `hdiutil attach -nomount ram://[ctrl-v]`

To get rid of the RAM disk (in OS X, see link above for Linux), just unmount it:

$> hdiutil unmount RAMDisk

The ingest rate never topped 100MB/s for either application (with thread count equal to core count) so the SSD is more than sufficient (reading from one, writing to the other) as both are easily capable of 250+MB/s. For the actual benchmark runs, I always like to see the scaling plateau where thread count equals the physical core count, so the thread counts ranged from 1 through 13 (physical core count is 8). Each execution per thread count is duplicated ten times, so that the spread (error) in the throughput range can be shown (never trust people that give you exact numbers when multi-threaded/multi-core/mem/IO ops are involved, there’s always a range!).

Benchmarking Results

Box & whisker plot of execution times for pbzip2 compared to rbzip2 on the platform listed above. Each thread count was run 10x, each box shows the 25th — 75th percentiles in the box and the 5th/95th percentiles at the end of the whiskers. As expected the performance practically plateaus at the machine core count (8-cores).

As we can see, despite being coded up in less than a half hour (benchmark executions took about a day), the performance is still quite good. In fact, for compression, the RaftLib version actually is slightly faster (and could probably be tuned a bit further with the file read “chunk size”). Given a newer CPU (and memory system), we could probably expect performance to go up (for both apps), however this is a fair comparison since we’re running back to back on the exact same hardware (and same bzip library version). The only potential drawback to using RaftLib as a statically linked library (if you care about the number of KB a binary takes up) is the file size. The pbzip2 binary as compiled (same libbz2 archive for both) is 76KB whereas the statically linked rbzip2 binary consumes 264KB of real estate, dynamic linking could fix this, and perhaps improve performance slightly if the instruction cache foot print is too large (with only 264KB, I doubt it, but it is remotely possible).

Thanks for reading my post. If you’re interested in RaftLib, please check out the project page