Surprisingly slow UPSERT performance: A Quest for Clarity

9 min readFeb 7, 2019

This article describes a quest for clarity on which I have been in recent days. I do not think that the outcome or lessons learned are in any way new or indeed very surprising, but since I have needed a few days of thinking and experimenting to get my head around things, it is my hope that this write up might help others or at least entertain them.

Introduction: the use case

It all started because somebody complained to me that “insert with overwrite” (aka “UPSERT”) is much slower in ArangoDB than pure “insert”. At first, I thought: “This cannot be right, I need to investigate and fix this problem!”. After all, in both cases, approximately the same amount of data needs to be written.

More concretely, the experiment was to insert batches of a few hundred thousand new JSON documents into a collection in an ArangoDB cluster, and initially, a batch took something like 30 s. This looked all right, but in the actual use case, some of the documents in each batch would be new documents (new primary keys), and some would be replacement documents for old ones (the same primary key as a previous document), which should simply overwrite the old revision of the document in the database.

The original API of ArangoDB has two different calls for these, the first is HTTP POST or “insert” for new documents, which returns an error, if any primary key of a submitted document already exists in the collection. The other is HTTP PUT or “replace” for old documents, which returns an error, if no document with the given primary key exists in the collection.

However, in the present use case we do not know a priori, which documents in a batch are new and which are old, so we actually need a third kind of operation, which I would like to call “insert with overwrite” or “UPSERT” or “REPSERT”. The new ArangoDB 3.4 actually added this operation to the API in the form of an overwrite=true option for the “insert” operation.

Unfortunately, the person reporting the problem to me said that in the “insert with overwrite” case a typical batch took 3 min instead of 30 s, which was 6 times as much!

Therefore, I did a quick experiment and found: Nothing! “Insert with overwrite” was approximately as fast as pure “insert”. So I went back and asked the person who had complained to provide more details and as a result, I found out that this was a case where the data set in the database is considerably larger than the available RAM of the machine. This was the first hint that something more sinister was going on behind the scenes.

I set up an experiment of the same flavor, this time with a single server ArangoDB instance and lots of data on a machine with relatively small RAM. And indeed, I quickly confirmed a substantial slowdown.

What was going on? How could we fix this problem? At this time I was still very much optimistic — and naive, of course.

What is going on? Thinking!

At first, I looked at the code for “insert” and compared it with “insert with overwrite”. And indeed, there is nothing too fancy: An insert operation for a new document first looks up the primary key, and if that is not found, the document is written. Note, however, that “not finding a key” is pretty cheap in RocksDB (which is our default storage engine). This is because clever bloom filter technology is used to quickly rule out the existence of a given key. This even works, if the data set is much larger than the available RAM, since the bloom filters for all data files can usually be held in RAM. And since RocksDB is mostly write-optimized, the normal “insert” operation is pretty quick.

However, in the “insert with overwrite” case, things are different: The code also looks up the primary key first. In this case, however, in the situation of an actual overwrite of an existing document, the key is found. And if the data set is much larger than the available RAM, a random document will usually not be resident in RAM, so that even finding the key is much more expensive, because something has to be scraped off the disk, which can be very slow in comparison to an in-memory lookup. We discuss the need for this lookup below in more detail.

Therefore I came after a certain amount of code study and general pondering to the conclusion, that the reads are actually slowing down the writes in this case!

Therefore I set out to confirm this theory in an experiment and also find out the severity of the problem.

What is going on? Measuring!

To quantify the effect and have an experiment to test solutions (which I still hoped to find at this stage!), I set up the following situation.

I used a relatively small instance on Google Cloud Engine with 4GB of RAM, 4 vCPUs and attached a 500GB network volume. 500GB is large enough to store data that is considerably larger than the available RAM, but it is at the same time small enough such that the volume does not allow for a lot of IO operations per second (IOPS). As most cloud providers do, Google throttles the number of IO operations to a value that is proportional to the size of the volume. For a 500GB volume, one gets currently 60MB read and 60MB write throughput per second, but only 375 read operations per second and 750 write operations per second.

On that system, I installed a single instance of ArangoDB 3.4 and inserted 10M documents of approximately 1200 bytes each in batches of 10000. I needed some experiments since RocksDB by default uses Snappy to compress data and initially I used the same string in each document to grow the size to 1200 bytes, which was promptly compressed very efficiently by RocksDB.

In the end, I produced a “random” string by picking two random numbers r and d, and concatenating the stringifications of r, r+d, r+2d, r+3d, … , r+99d into one large string. This was sufficiently “chaotic” to avoid compression.

More accurately, I used the community version of ArangoDB 3.4 in the .tar.gz binary distribution and used the following startup command:

arangodb --starter.mode=single --all.cache.size 256000000 --all.query.memory-limit 100000000 --all.rocksdb.block-cache-size 256000000 --all.rocksdb.enforce-block-cache-size-limit true --all.rocksdb.total-write-buffer-size 100000000

The additional options are used to limit the RAM usage of the ArangoDB instance.

Here is the result of the insertion times for the batches of 10000 such documents, which means each batch is approximately 12MB:

+-----------+----------+
|    What   |   Time   |
+-----------+----------+
| Minimum   | 415 ms   |
| Median    | 571 ms   |
| Average   | 659 ms   |
| 90%ile    | 948 ms   |
| 99%ile    | 1538 ms  |
| Maximum   | 1857 ms  |
+-----------+----------+

That is approximately 20MB of data written per second in the Median, which is not too bad, in particular considering the write amplification RocksDB has (since it is a log-structured merge tree implementation).

Here are the times for performing “insert with overwrite” for batches of 10000 documents whose primary keys are all already in the database. These were the times needed to replace 10000 documents:

+-----------+------------+
|    What   |    Time    |
+-----------+------------+
| Minimum   | 1358 ms    |
| Median    | 7807 ms    |
| Average   | 9783 ms    |
| 90%ile    | 21212 ms   |
| 99%ile    | 44423 ms   |
| Maximum   | 44423 ms   |
+-----------+------------+

I should admit that I did not replace all 1000 batches of 10000 documents but only 100, so this is a smaller statistics. Furthermore, I made sure that the replacement batches are actually leading to random accesses by essentially putting 10000 random documents in each batch. That is, the “insert with overwrite” operation had to find documents in random order and had to reach out to disk for most of them!

This quantifies the effect that was originally observed. On average, the overwrite operation needs approximately 15 times as much time as the insert operation. Furthermore, performance is much more volatile, since it depends on what is found in RAM caches and how many actual IO operations have to be done for each document.

I then patched ArangoDB and made it so that in the “insert and overwrite” operation no read whatsoever was performed and only the write operations for the new document were done. Note that this is not a possible improvement, because it corrupts the internal data structures, document counts were no longer right and subsequent queries would return both the old as well as the new revision, which is utterly incorrect.

However, the “insert with overwrite” times were essentially down to the old “insert” times, maybe even a bit faster since I completely left out the primary index lookup. This essentially confirmed my theory.

The code for this part of the experiment can be found here. It is JavaScript code for arangosh.

What is going on? Competitors!

Next, I was wondering whether our competitors had the same problem. Therefore I looked at PostgreSQL and set up essentially the same experiment with one table. See here for the code I used. In this case, I used the golang driver for PostgreSQL and did my best to replicate the experiment. I have to admit that I am not a PostgreSQL expert, so feel free to point me to possible improvements. I simply used SQL INSERT statements for the “insert” and SQL INSERT statements with ON CONFLICT (key) DO UPDATE … for each batch.

I used PostgreSQL 9.6 which comes bundled with the Debian Linux distribution and only changed the following default configuration parameters:

data_directory = '/data/postgresdata'    # to use the right volume
shared_buffers = 256MB
temp_buffers = 64MB
work_mem = 64MB

And indeed, the performance behavior was very similar. Here are the numbers for the “insert” batches:

+-----------+----------+
|    What   |   Time   |
+-----------+----------+
| Minimum   | 258 ms   |
| Median    | 310 ms   |
| Average   | 389 ms   |
| 90%ile    | 468 ms   |
| 99%ile    | 1841 ms  |
| Maximum   | 2300 ms  |
+-----------+----------+

And here are the times for the “insert with overwrite” operation, this time I only did 50 batches of 10000 rows:

+-----------+------------+
|    What   |    Time    |
+-----------+------------+
| Minimum   | 96178 ms   |
| Median    | 108445 ms  |
| Average   | 109383 ms  |
| 90%ile    | 116039 ms  |
| 99%ile    | 125041 ms  |
| Maximum   | 125041 ms  |
+-----------+------------+

This means that on average, an “overwrite” batch of 10000 takes 281 times as long as a pure “insert” batch! So we are not so bad, after all.

For fairness, one should mention that PostgreSQL uses by default less RAM than ArangoDB for internal caches and relies more on the operating system’s buffer cache. I did not investigate what is actually done behind the scenes. I also briefly tried PostgreSQL Version 10, but this showed pretty much the same behaviour.

By the way, here is a screenshot of the monitoring overview for the instance on the Google Compute Engine Console:

This shows that the only metric which is really showing a bottleneck for throughput is the IOPS (“Laufwerks E/A Vorgaenge” in German).

Result: A fundamental problem

In any case, when I saw these numbers and that the very renowned PostgreSQL also suffers from this problem, it dawned on me that this is maybe a fundamental problem.

And indeed, the argument goes as follows: Imagine any data store which stores “records” of some sorts (e.g. JSON documents in ArangoDB’s case, rows for PostgreSQL) that allows secondary indexes. Assume that you store considerably more data in the data store than the available RAM. This seems like a very general situation and by far not uncommon.

It is conceivable that a pure “insert” operation can essentially be implemented to be approximately as fast in throughput as the underlying disk system can write out the data, probably always taking some kind of write amplification into account. Our two examples show this quite clearly.

However, for any kind of “overwrite” operation, it is clear that due to the sheer mass of data most records will not reside in RAM or in any cache in RAM. However, replacing a record makes it obviously necessary to update the secondary indexes. To this end, the old version of the record is needed to allow for the removal of it in all secondary indexes.

As a concrete example, if you have a secondary index on “name”, and somebody changes a record in a way such that the name changes, then the system has to read the old version of the record to know which entry in the secondary index it has to change.

But however, the data is organized on disk, reading the old record needs at least one random read access out to disk. Depending on the underlying disk, this can easily take 3 orders of magnitude longer than a simple lookup in RAM!

Therefore, after studying this problem for a while, it seems to me that this is a fundamental impediment which cannot be overcome.