Data pipeline performance (part 1)

If data scientists had their way, we would buy petabytes of memory and never think about IO performance or serialization ever again. Such is the influence of accountants, however, that few companies have realized this otherwise brilliant strategy; instead, whether the workflow is written in Hadoop or in bash, it usually amounts to “stream data in, process a little bit, stream data out.” And this means that performance isn’t purely a matter of profiling code; it’s going to depend on stuff we don’t control like gzip and grep.

This post is about some experiments I ran to build a better intuition about pipeline performance, and which later informed the design of ni and other tools I use at Factual.

Before we get going: some UNIX measurement tools

Two common and absolutely essential utilities are pv and units.

pv is cat, but it gives you a live indicator of how fast data is moving, and your progress/ETA if you use it on a file. For example:

This functionality is so useful that I’ve written it into every pipeline constructor I use: nfu has a throughput monitor, and its successor ni takes the concept further with throughput and data pressure.

units is a command-line calculator that can do unit conversions (like the ones Google supports). For instance:

These tools are great because they make it easy to think about units of throughput, rather than just time — and that’s exactly the type of intuition required when you’re working on large datasets.

When time and throughput are different: fast and slow languages

A “slow language” usually means a language with a high interpreter overhead — Ruby and Perl sink to the bottom of the benchmarks game for this reason. But that doesn’t mean these languages will be slow for text-processing or similar use cases; ni, for example, is written in Perl and can process text at 800MB/s with multithreading; Ruby could easily do the same. The key is that these languages provide specialized, efficient vectorized operations that outperform obvious native implementations.

For example, I wrote three line counting programs, one in C and two in Perl:

Here’s wcl1.pl, the obvious version:

And here’s wcl2.pl, which doesn't reallocate any memory:

I then benchmarked their throughput using a high-speed data source:

There’s nothing up my sleeve here: if you run the above programs, you should see Perl count lines about 3.5x as fast as C. gcc mostly fixes the CPU bottleneck when you compile with -O3, now outperforming Perl as we'd expect:

And, of course, the real wc -l is faster still, saturating the data source:

What’s going on here

C is a “fast language” in that it has no interpreter overhead, but the CPU can move data only so quickly itself. Taking a look at the assembly for wcl.c compiled at the default and -O3:

It makes sense that this would run slowly: each byte of input data becomes 12 CPU instructions. To get the data rate we observed of ~320MB/sec, the CPU’s throughput would have to be about 3.5Ginsns/sec — about right, given that modern CPUs have some instructions that amortize to 1/2 or 1/3 of a clock cycle. When we compile with -O3, the picture changes completely. gcc unrolls the loop by a factor of 16 and uses SSE instructions:

This implementation uses 38 instructions for 16 bytes, which is about six times as efficient as before.

Ok, but what’s going on with wc -l? The source provides a hint:

Intuitively this makes sense: function call overhead would dictate that the more work memchr can do internally, the faster this will go. But memchr itself is still outperforming the -O3 byte loop by more than 2x. Here's how it works:

Like -O3, it's also vectorizing; but this version, although more efficient for long lines, is optimized for cases where most of the time is spent searching (i.e. a hit on any given byte is unlikely). All of wc -l's advantage is data-dependent, which becomes evident if we feed it just a bunch of newlines:

memchr is why Perl ends up being faster than the original C version: most of the data is being pushed through the very high-throughput inner loop, and only a few bytes have to go the slow route through the Perl interpreter itself. Other aspects of overhead, like copying lines into $_ and updating various internal variables, made it unable to compete with -O3 even for long lines.

The Takeaway

In a throughput-bound world, what’s the right tool for the job? I haven’t found a particularly good way to predict it aside from understanding how things work and testing on specific data. The complicating factors are that (1) many tools have been optimized for a specific use case (e.g. Perl and text processing), and (2) many of the fastest algorithms are data-dependent and aren’t worst-case optimal. Agner Fog’s optimization guides illustrate the subtleties of even the hardware itself; as the infrastructure increases in complexity, optimization becomes an increasingly vertical pursuit.

In part 2 I’ll go into some lower-level aspects like input encoding and garbage collection, and part 3 will be an unbelievably smashing wrap-up of some kind. Stay tuned.

- Spencer Tipping, Software Engineer

Enjoy this read?

  • Please click the heart icon to recommend it to others.
  • You can read more Factual Engineering posts at our blog.
  • You may be interested in working at Factual! See our openings.

Originally published at http://www.factual.com/blog/Data-Pipeline-Performance

Like what you read? Give Factual Team a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.