in the real world
“Kanceľarśka Sotńa” (The White-collar Hundred) is an NGO in Ukraine, which has a goal of facilitating change in our civil society with the help of technology. This means we’re facing various tech challenges as a part of our efforts in projects such as (most sites without an English version, sorry) digitising old paper declarations with crowd-sourcing, searching for owners of elite property, registry of politically exposed persons, politician assistants registry. Some of this work is cross-pollinating with other initiatives, such as the open-source natural language processing groundwork for the Ukrainian language.
Apache CouchDB 2.0 is our choice to be the core part of our latest (early in development) project, in which we strive to eventually process millions of tax and property ownership declarations filed periodically by Ukrainian public servants and feed them to the business intelligence tool for analysis or act upon based on certain rules. CouchDB’s purpose will be running a number of non-trivial map functions over each document, which themselves are more likely than not to be large and nested.
It is, however, important to measure performance with the real datasets and functions to have a working understanding of timings and resource implications of such a system.
There are various excellent benchmarks out in the wild, such as this one by Kevin J. Qiu. But it’s measuring synthetic datasets and simple functions, which is good for a raw ops or docs per second on a certain set-up but real world is frequently different from our assumptions.
So let’s start from the beginning.
For this article we’ll be using a subset of public documents from National Agency on Corruption Prevention of Ukraine. Namely 290406 declarations with a rather complex structure. This dataset has undergone some preprocessing, such as normalising numericals from string representation to floating point numbers. Massaging the data is a useful procedure to do in general as it saves computational costs when applying the functions later on. Better to spend some resources once rather than waste them repeatedly.
Raw size in JSON before importing is ≈4.8GB. CouchDB divides this dataset into 8 shards, totalling in ≈0.9GB on disk (compressed with Google’s Snappy). Raw JSON (compressed) is available here. In future I’m hoping to set up some kind of a public node, available for read replication (Cloudant possibly?)
Eventually we want to scale for millions of documents and tens of GBs, growing every year with new documents.
Tested query servers are the default “couchjs” (pretty old Mozilla SpiderMonkey 1.8.5 under the hood), “couch-chakra” (experimental JS query server utilising MS ChakraCore), “couchpy” with CPython 2.7.9 (unfortunately couchpy doesn’t support Python 3 in the query server) and with PyPy 5.6.0 (Python 2) as well as Erlang via the native query server. We will be testing each query server with functions as similar as possible to each other. In addition to this we’ll use a “baseline” Erlang function, which is as minimal as possible while doing something useful.
Performance metrics will be measured by a script based on Kevin’s work, which queries CouchDB’s “active tasks” sampled every second, recording the time difference between task last updated and started as well as a number of changes processed by the task. Rate is the number of changes divided by the time difference. We will output a sum over all tasks of the sample and metric of a single task. At the end of the process we’ll average all the sums for the average speed and output total time taken by the last active task (which is the longest task remaining).
At the same time we will manually query system resources utilisation when processing is nearing its finishing line. For this we’ll employ the “pidstat” program from the awesome sysstat package with the following command:
sudo pidstat -hrudvlI -C "beam|couchjs|couch-chakra|python|pypy" 2 1 > /path/to/stat.log
This will collect and report CPU (accounting for SMP), memory, I/O and OS threads for processes, which have “beam” (Erlang VM), “couchjs”, “couch-chakra”, “python” or “pypy” in their respective commands once in 2 sec interval.
CouchDB itself will be running from a specially built docker container in the single node mode with default settings (except the additional query servers). Before each test run the database will be purged and whole dataset reimported. This is due to CouchDB never really deleting anything, being based on an append-only data-structure, which is actually rather good for real use cases (if regularly maintained) but not so much for repeated benchmarking. Views will reuse as much data as they can even from deleted entities (in reality only marked as deleted). CouchDB can be coerced to really drop it by compacting and the view clean-up procedure but simply purging the DB is more reliable and I’m lazy :)
Note: this article will not test reduce or any other functions, which CouchDB supports.
Code for docker images, reference functions and helper scripts can be found in the project’s repository.
The Test Rig
The DB will be running on my laptop, sporting a 4 core (8 with HT) Intel i7-4710HQ CPU @ 2.50GHz, 16GB DDR3L RAM, 1TB Seagate Hybrid “SSHD” (yes, not SSD, see results for additional info). The host OS is Fedora Linux 25 with Linux kernel 4.9.13. All tests will be executed from VT with graphical environment completely turned off to minimise external influence on the resources.
Docker’s storage engine is “overlay2” supported by the in-kernel “overlay” FS. The DB’s data volume is directly mapped to the host FS, which is “ext4” mounted with
Right, that’s enough of boring blabber — moving on to the core of it!
In order to have some sort of a level above which we can’t go too far with this set-up, I have prepared a very simple map function in Erlang. Full disclaimer before we move further: I’m not very good at Erlang so please forgive me if your functional programming feelings are gravely hurt by what follows (and please provide feedback!).
This maps a document to a simple representation, provided a value at
step_0.declarationType equals to “1”.
INFO:dragnet.addview:average = 1614.30c/s (all_tasks) 203.20c/s (one task)
INFO:dragnet.addview:total time = 180s
So this was complete in exactly 3 minutes with average rate of 1614.30 documents (or changes) per second across all 8 tasks (one task per shard) or 203.20 for one of those tasks. This is pretty much the best we can do on this dataset.
It’s very difficult to measure resource usage of the native Erlang view tasks since they run inside the CouchDB’s process and they’re indistinguishable so I’m not providing this.
Well, ECMAScript. Or what is it now?
Let’s start with the built-in “couchjs” query server. It’s not really as built-in as Erlang, in the sense that it’s actually a separate program communicating with CouchDB via pipe rather than with internal mechanisms.
This is written in ES5.1, which SpiderMonkey 1.8.5 understands fine. Note that it’s much more complex than the baseline function. It still maps documents into much smaller representations but at the same time does a bit of iterating over the data-structure, doing some comparisons and arithmetic operations. In previous incarnation it also included parsing of the numeric value from
incomeDoc.sizeIncome, but since we moved this into the import helper it’s not needed any more.
How bad will it be?
INFO:dragnet.addview:average = 1316.46c/s (all_tasks) 164.20c/s (one task)
INFO:dragnet.addview:total time = 228s
Now, that’s somewhat worse than the baseline but not really in a critical way. Let’s see the resource usage (which we can separate now).
Although CPU stats aren’t very useful as they’re relative, this shows the proportions. Virtual memory size (“VSZ”) and minor page faults clearly show that CouchDB managed to put whole dataset in the cache, while operationally using only what it needs at the moment. No one is blocked on disk I/O and the database does some moderate writes. Memory usage by the query server processes is not too bad but could be better I think.
All right, that wasn’t all bad. On to the fancy “couch-chakra” and its Microsoft ChakraCore engine, which powers their new Edge browser. We’ll be doing ES6 with this one since the engine supports it natively.
This isn’t hugely different in this case, just a minor convenience but it could save you some pain in the long term.
In this test run its performance was just a tiny bit better than couchjs and that’s the general trend I’ve been seeing with it — consistently better but only by a small margin.
INFO:dragnet.addview:average = 1330.95c/s (all_tasks) 166.98c/s (one task)
INFO:dragnet.addview:total time = 226s
Resource usage draws a slightly different picture.
Less real memory usage by the
couch-chakraprocesses, which is no surprise since Microsoft is marketing ChakraCore for the IoT scene. I’m not quite sure what’s up with the virtual memory usage (35GB!), quite possibly it’s just a bug somewhere as this query server is very early in development at the moment. CPU utilisation is similar.
I’d say it’s a good prospect and native support for ES6+ is a good thing in my book. Moreover, the project’s author, Daniel Münch, is doing very exciting work on making it a native server on par with Erlang.
But that’s enough ECMAScript for this article, forwards to…
The “couchpy” project currently supports Python 3 for its client but not for the query server. I honestly tried running it on Python 3 but it crashed in so many ways that I didn’t bother to continue so Python 2 it is then.
It’s conceptually doing the same thing as ES functions in the closest way possible.
We’ll do this with two runtimes: CPython 2.7.9 and PyPy 5.6.0. First CPython’s results.
INFO:dragnet.addview:average = 1341.77c/s (all_tasks) 167.98c/s (one task)
INFO:dragnet.addview:total time = 222s
A step more closer but again, not substantially.
Now this is interesting. About twice less memory than both ES servers, less CPU usage by the query server process, no page faults, more efficient CouchDB process (possibly wastes less time on waiting for results). It’s definitely doing something a lot better. Not sure whether it’s the interpreter (which isn’t even JITed!), query server or function has hit some happy path or perhaps it’s a combination of factors.
But let’s look at PyPy with the same function. Surely everything is even better with JIT, right?
INFO:dragnet.addview:average = 1154.04c/s (all_tasks) 143.07c/s (one task)
INFO:dragnet.addview:total time = 251s
Higher memory usage is actually expected due to JIT and GC doing their job. In fact, this bad (actually worst) result can be explained by the overhead for JIT compilation. What matters in this test is really just executing a small-ish piece of code as fast as possible, not letting JIT to warm up properly. PyPy shines on heavy number crunching and running complex object-oriented code, not tests. Quite probably some functions may hit bad paths with CPython and perform much better with PyPy.
And now to something completely different…
OK, I admit, originally I really did this just to look like one of those cool FP kids on the block :) Even though it took me a few hours to actually write an identically working function, which doesn’t crash, it was a quite fun exercise. I don’t get to do much functional programming in my daily routine so bear with me.
The main part of it is
Calc_income/2 function, which together with
lists:foldl/3 replaces the iteration of previous cases by a tail-optimised recursive reduction. The code is somewhat “uglified” by the fact that CouchDB only accepts a single anonymous function so one can’t divide it in several pattern-matched proper functions.
Anyway, will it blend?
INFO:dragnet.addview:average = 1507.90c/s (all_tasks) 190.04c/s (one task)
INFO:dragnet.addview:total time = 197s
Oh yes. This is much closer to our baseline. And in fact, this will always be the case with native query servers. Unless you manage to write a very bad function (which is not that difficult, mind you!)
The reason is it never does any pipe I/O, running inside the Erlang VM in the same process as CouchDB, thus decreasing the overhead to the absolute minimum.
At the same time this clearly displays how making the function more complex affects the processing rate against the baseline.
I’m not providing the resources usage stats for the same reasons as with the baseline.
The Input, the Output and the Lack of Them
So far, even with Erlang, the results don’t appear too spread out, not like on the order of magnitude or even by the factor of two. So this made me thinking, could it be that disk I/O is getting in the way with my HDD. Could it be the case that SSD would perform better? An attentive reader would notice that disk I/O stats weren’t showing too much writes (maybe 2MB/s at most) and no I/O blocking has been happening, but then there’s latency to take into account.
Even though I didn’t have a comparable SSD system at hand, there’s an even better way to rule out disk I/O — the ramdisk.
tmpfs is a special kind of a filesystem on Linux, which maps a FS location to a region in volatile memory so that all I/O in this location happens in your RAM. The only requirement is having enough RAM for this location’s whole contents, otherwise it will start swapping (or OOMing if there’s no swap).
So I restarted my docker container with the data volume mapped to a directory inside
/tmp, which on Linux is typically mounted as
tmpfs. With 16GB RAM and about 1GB of data this is totally feasible.
It only makes sense to test this first with the baseline Erlang function. If disk I/O is the bottleneck — it should be measurably faster. If it’s not faster none other will be.
The moment of truth…
INFO:dragnet.addview:average = 1622.27c/s (all_tasks) 203.39c/s (one task)
INFO:dragnet.addview:total time = 180s
It is almost exactly the same as on HDD. Just to make sure we’re not doing any disk access, this time we’ll measure the resource usage too.
Nope, no I/O with 3 consecutive queries in a 2 sec interval so this looks correct.
This leads us to…
Which, I believe, are quite self-evident by now but here’s some quick charts just to recap.
With a big enough and complex enough dataset, applying non-trivial functions is generally CPU-bound. SSDs won’t add much value for writes and when the data fits into the memory — for reads, different query servers can be better than others only this much in terms of performance.
This, of course, isn’t the whole story. Every benchmark should be taken with a grain of salt as it can’t in reality permute all possible use cases. This one, I’d like to reiterate, is about complex dataset and non-trivial functions. Another point is that even small (but measurable) difference in performance matters for scaling. 10s difference on 1GB can bring 10-s of minutes on 10-s of GBs. Or not. This extrapolation needs testing too (which I plan to do continuously, thankfully this is rather easy with our tools).
I did actually test this on a much less powerful system (with an SSD and still fitting data into the RAM though) and the difference between, say, native Erlang and couchjs was about 2 times in favour of the former. As in ≈400s to ≈800s, and about ≈600s for CPython. Unfortunately I did not record the results — I’m too lazy to do this rigorously again on a slow machine but this proves the point.
Don’t uplift benchmarks to the absolute, instead infer the trends, make your own tests on your own use cases. Also… don’t process gigabytes of complex data on your toaster.
As final words, I’d like to share some thoughts about our use case.
It would be tempting to just say “let’s use CPython and be done with it” (because saying the same about Erlang would most likely alienate myself from the society for the end of days). It does indeed look pretty good in all regards but I’m not sure it’s actually a good future-proof choice. Thing is it’s not guaranteed to run great on all kinds of functions, while ES performance is more homogeneous. ES’ adoption is more widespread and author of “couch-chakra” has some great ideas on eliminating the pipe I/O overhead. So for now I’m more in favour of using “couch-chakra” in our project as the main query server and ES6+ as the main language.
In addition to this, full dataset processing performance is not so important for us — this won’t be happening too frequently and we can implement some kind of warm-up queries and use “stale” parameter for the real queries in order to have almost up to date views and be able to retrieve them immediately too. Besides, CouchDB is actually pretty good at reusing as much data as possible and avoiding full recomputes. It will also apply map incrementally to incoming documents so simple dataset updates won’t typically trigger full database processing with all available views (which can obviously take ages).
When one server stops being enough, CouchDB 2.0+ makes clustering fairly straightforward, simply placing shards on more nodes and balancing the requests with haproxy. Master-master replication makes cluster configuration changes possible with minimal pain so this is mainly covered.
More preprocessing, such as making the IDs smaller (this can have tremendous effect on view size and insert time), encoding choice values, etc. should be considered.
I will follow up on this article when plans actually come to fruition and this starts working with the rest of the services. But that’s it for now. Hope you enjoyed it!