“Beating C” with 120 Lines of Rust: wc

Martin Mroz
5 min readOct 30, 2019

--

It’s something of a meme lately to see whether your programming language of choice can take on the venerable wc, and what that might look like. The format seems to be: first do it simply, then idiomatically, and finally much faster. Of course, we’re not really “beating C” but rather “tackling a fun interview question in our favorite programming language.” My go-to these days is Rust, and since I’ve fielded the question of whether Rust is “my Haskell,” this all was too much to pass up. Let’s get started.

We’ll be following the same rules of engagement as Chris Penner:

  • It should return identical character, word, and line counts as wc on the test files.
  • Speed will be measured against a release build of the Rust version using the time utility.
  • Maximum resident memory will be measured against a release build of the Rust version using the time utility.

I’ve also decided to structure this post to mirror Chris so it’s easy to compare and contrast. I think this is particularly interesting because the idiomatic Haskell version and Rust version share an awful lot in common (down to the final performance characteristics).

The Dumbest Thing That Could Possibly Work

The dumbest thing we could possibly do in Rust is pretty similar to the dumbest thing we could do in Haskell, as it turns out. Probably anywhere else, too. Load up the whole file into a single string, and iterate over it a couple of times to yield the counts.

Clean and readable, sure. Performant? Less so.

Let’s just say it performed… expectedly. Running against a ~100MB file, the amount of memory used was… ~100MB. Since we iterate over the contents of the file twice, sequentially, performance was not great — but it was in the ball-park. It could have been much worse, to be sure.

Clearly, memory usage could be better.

Doing Something Slightly Less Dumb

Clearly, there are things we could do to be more efficient in terms of CPU. We could parallelize the computation of the words and the lines. Or we could just iterate over everything once. However, unlike the Haskell implementation (which at this point was almost two orders of magnitude slower than wc and used three orders of magnitude more memory), our first stab is in the ball-park for performance already. Our low-hanging fruit is obviously going to be memory-related. I’ll save the CPU boost for last.

Memory usage at this point is directly proportional to the size of our input file. Unless we want our customers to have to buy more RAM to count the words and lines in any file on their hard drives, this probably won’t work. Luckily, Rust has some great primitives for working with streams.

Leveraging a BufferedReader instance to process one chunk at a time.

This is nice, and doesn’t really yield much additional complexity — but it’s not exactly functional in style. All we have to do is track whether each buffer ends on a space or a non-space character and keep a running tally of bytes, words and lines. We’re still iterating over everything twice, but at this point, we’re pretty much toe-to-toe with the C wc utility in terms of both performance and memory usage. We’re actually using a decent chunk less memory because I didn’t notice a material difference in runtime between a 512kB buffer and a 1MB buffer.

Performance parity achieved.

Moving to Monoids (?!)

Now that we’ve reached parity, the question becomes: how can we achieve even better performance than C? The answer to this question is usually some form of parallelism: data-level parallelism with SIMD or instruction-level parallelism with threading. This problem is sequential by nature, though, so what’s to be done?

Before I started in on this article, I didn’t know what a monoid was. I’m still not sure I do. However, I quickly realized the approach used by Chris in Haskell would translate very nicely to the functional programming world of Rust, and I quickly set about creating a “Flux” type as Chris did.

This type tracks the attributes of any sub-section of our input file, from a single character all the way up to the entire file. As we learned in the iterative approach, we need to know what the left-most and right-most character type is for any given range, and the number of words or lines it contains.

Not exactly a monoid, although Option<Flux> should be, if I’m not mistaken.

This is the key to parallelization. This is what’s going to get us past the performance plateau — and what’s going to let us “defeat C.” The first thing we do is implement a method to make a Flux instance from any byte of the input. A natural way to do this in Rust is to implement the From<T> trait.

A single character is effectively a span of length 1, meaning it can represent a “word” (if it’s not a space), a “space” if it’s a whitespace character — or even a “line” if the character itself is a newline.

Mapping a single character onto a Flux instance.

Then, we can merge any two adjacent Flux instances into a one representing their span as follows. When operating on two Option<Flux> instances (as a helper method I created does) the span function represents the single associative binary operation that makes it a monoid, apparently.

Merging two adjacent Flux instances into their span.

With this, we have everything we need to create a totally functional version of our original buffered processing loop.

A functional approach to computing the span over an entire byte string by spanning its constituents.

Obviously, we don’t expect this to be any faster than before as it doesn’t change our level of parallelism. Indeed, a quick benchmark showed little change. However, by re-expressing the problem in a way that lends itself to arbitrary levels of parallelism, we can take advantage of the easiest performance boost you’re likely ever to see: rayon. Rayon is a data parallelism library for Rust. With just a simple three-line change to the function above, we can leave it to Rayon to automatically spread the work performed here across all available CPU cores.

The only change? par_iter, fold and reduce.

So, what’s the final result look like? Pretty darn good.

Final numbers? 2X faster, 10% smaller footprint.

To be sure, we could probably make it faster. For instance, by loading the next buffer while the current one is being processed. In my opinion that’s a much less interesting optimization and this is a learning experience after all! With that, I’m going to call this a success.

So… what did we learn?

This is by no means a Rust vs. C vs. Haskell comparison post. For me, the reason these types of explorations are valuable are because it forces me to think outside the box. I wouldn’t ever have thought to leverage a monoid to achieve such a high degree of parallelism. Does this mean wc should be thrown out? Definitely not. Does this give me a new tool in my toolbox for tackling engineering problems in general? 100%. Probably a new interview question, too!

Even if we don’t have time to go out and learn every programming language, exploring some of the most “different” ones out there, understanding their approaches and the value they each bring to the table, and incorporating the good bits into our work is how we continue to grow as engineers.

As always, the source is available for your review. Thoughts welcome, especially if I’m still wrong on what a monoid is 😂 !

A Quick Shout-Out to Testing in Rust

I just wanted to take a moment and call out how much I appreciate Rust’s in-line unit testing philosophy. It’s such a relief to be able to, while developing a command line tool, throw in a test module and cargo test it. Especially while re-factoring as I did many times over the course of this article.

Writing a quick unit test for the heart of the binary utility.

--

--

Martin Mroz

🇨🇦 🇬🇧 🇵🇱 Engineer, nomad, startup person. Formerly at @apple and @square. I think a lot on iOS, embedded, travel, entrepreneurship and immigration.