A CSV Parser. Moving from Scala Parser Combinators to Parboiled2

CSVs are an ubiquitous format for all sorts of tabular data. I assume that every major programming language ecosystem has a handful of libraries handling both parsing and writing them. In Scala you can quickly turn to StackOverflow to find a 30 lines long CSV parser based on the Scala Parser Combinators library. In this short blog post I will show a Parboiled2-based a version of this parser and compare the performance of the two.

Parboiled2 is a lightweight parser generator based on Scala Macros. It compiles the defined grammar rules into JVM bytecode. It has a similar syntax as Scala Parser Combinators. If you want to know more about the project and the rule DSL I’d recommend you checking out it’s README file or the related ScalaDays 2014 talk.

Below is a gist showing the CSV (actually, you can use an arbitrary delimiter, but for simplicity’s sake I’ll keep using the word CSV) parser definition:

It’s structure is roughly the same as that of the parser from StackOverflow.

(Quick) performance comparison

I was expecting the performance differences to be large I didn’t do a thorough benchmark. Both parsers are started on the same JVM, I did not do any VM tuning nor minimize the OS noise. The performance differences, as you will see are so large that getting the exact numbers did not make much sense.

For testing the I took a 1k and 10k-row samples of the MovieLens dataset. The data is in a form of 4 element integer tuples. It was converted to a comma-separated format.

The tests were ran using ScalaMeter’s Quickbenchmark. It handles running a warmup phase and delivers aggregate results of multiple benchmark runs. The aggregate method chosen was min — the shortest out of 10 runs. All tests were performed on an late 2013 15′ Retina MacBook Pro running a Haswell quad-core 2.6GHz CPU (i7–4960HQ) on Oracle’s Java 8 (1.8.0).

The code for the benchmark is available on github.

The results

The chart above presents lines parsed per millisecond (the larger bar the better).

The Parboiled2-based parser is up to 500 times more performant than the Parser Combinators one. However if we split the input on new line characters before parsing it with the PC parser the speedup factor is around 8 and is independent of the input size. However I suspect it would degrade if the input file had more columns.

Although this is a very small benchmark it’s reasonable to assume the performance differences would be similar for other parsers.

Originally published at maciejb.me on July 11, 2014.