That one time I had a Postgres cluster on my veranda

Published in

Digital Frontiers — Das Blog

12 min readJun 24, 2022

This is the story of a performance problem with Postgres that required tearing apart every part of the tech stack, scratching the realms of absurdity.

Motivation

For another blogpost (will be linked here when ready), I did some research regarding the performance we can expect when working with the Axon framework using different event stores.

The initial idea comes from a slide (see below) in a talk of Allard (the CTO of Axon) he included in a presentation. It seems, there is some information missing. So I set out to compare a few Event Stores, starting with PostgreSQL. Turns out, I was sidetracked before I was even able to compare them…

Context

To measure the performance of the event stores, we first need to take a look at the workload characteristic here. Axon allows for Event Sourcing (ES), which means that we store events in an event store and source (aka. read) all events belonging to an aggregate (think of business object) when we need to make decisions about it. That this sourcing process is resource-intensive goes without saying, but I want to have reliable numbers when we will hit the performance wall.

But before measuring this quite complex mix of inserts and reads, I wanted to start by measuring the most simple case of only inserting events. This translates to creating aggregates with only one event in ES lingo. Insert heavy workloads are what Allard describes as the source of the above graph.

So we will create entries in the DB in a loop and measure how fast we can do this.

A baseline

Before we can play around with parameters, we have to generate a baseline against which we can compare our modifications. Further, it makes sense to make a few assumptions about the system that we can confirm or reject over time. Since this is a disk-heavy workload, suspecting disk-IO to be the limiting factor is a sensible first assumption.

The Setup

Axon works on the JVM, so I wrote a bit of benchmark code in Kotlin using Spring. This code measures the time it takes to insert a batch of events and logs that time to a CSV file. To use Postgres as an event store, one can choose to use the JPA adapter. Simply dropping it on the classpath will make the spring autoconfiguration magic do its thing.

For the correctness of ES, it is essential that each event is persisted to disk before we decide to act upon it. In the case of Axon with the JPA backend, this translates to one transaction per created event. So we can expect the transaction rate to be an important factor. You can find the code under synapse in the GitHub repo.

Regarding hardware, things are quite boring: The test machine has a Ryzen CPU, 32Gb of RAM and a M2 SSD.

First measures

As a first approximation, I simply looked at the rate with which events were persisted, averaged over all 1 million events inserted. This clocked in at about slightly over 250 events per second, which falls way below my expectations. So let’s have a closer look at the performance over time. Using R, I created some graphs based on the logged CSV files.

On the x axis you can see the number of events currently in the database. On the y axis, you see the average time it took to insert one event in a batch of 1000 events.

We can see the running average (red) settles somewhere between 3.5 to 4 ms per event, matching the observed rate of about 250 events per second. We also see that there is a lot of jitter going on and there seems to be some lower boundary that is never crossed again, after a certain point. Now let’s talk about the elephant in the room: What happened to the first 50.000 events? These first events were much faster than the rest and showed less jitter.

I asked the same question in the team chat and got a few hints, which led me to drill down on the problem way deeper than expected.

Beginning the journey

The first tip was to have a look at the GC. This was quite reasonable: During my Bachelor’s thesis, I observed comparable behavior on the JVM, pictured below. If we ignore the blurred-out scale and the orange square of warmup-phase (I also included a warmup phase in our benchmarks I did not tell you about), the graph shows the event rate of a (completely different) system along with the moving average. The sudden decrease in performance is comparable to our current benchmark. So did we just find the culprit?

Performance of a different JVM application, degrading because of GC pressure

The system in this image had severe problems with GC pressure, leading to ever-increasing GC pauses. So how can we tell if our benchmark is affected by GC pauses? One way is to fire up the profiler and record them.

While I did this, I found another, much easier way to rule out the GC as the cause of the problems: I re-implemented the query and measurement code in a different language with no GC. (See neuron in the GitHub repo).

Performance of implementations in Java and Rust

This image shows the performance of the JVM implementation (blue) compared to an implementation in Rust (brown). We can see similar jitter, although not as strong, as well as the same behavior for the first roughly 50K events.

Before we leave the JVM behind, let me summarize what the profiler told us. The garbage collector was bored and did run sporadically. Further, the profiler could reveal what the application did with the time it was given:

60%: com.zaxxer.hikari.pool.HikariProxyConnection.commit()
24%: org.axonframework…JpaEventStorageEngine….appendEvents

So about 84% of all the application does is work with the DB in these two functions. A few more places that even increased this number are not shown here. What is not directly visible, is that the appendEvents function starts the transaction. Our initial assumption that the DB transaction rate will be crucial just got more support.

Since we just ruled out the JVM from being our problem, we need to descend one layer.

But before we can continue down our stack, I want to interject that we need to take extreme caution when measuring performance results.

A quite important Interlude: Attention with your measurements!

A quick side note here: Pay attention not to mess with your results by simply measuring. Since waiting for the benchmark to finish and R to draw me some pretty graphs became tedious, I wanted to have a more direct way of seeing what the application does at the moment.

So I brought in Prometheus to have live metrics of the application. This later turned out to be a major step to identify the problem, I just did not know it yet, even less the way it helped. However, I did set up a Prometheus instance on the same machine and set it to scrape the application metrics. Unfortunately, I had still set a scrape-interval of 10ms, from an experiment a few months back. This led to interesting results: In the following, you can see the rendered CSV data of a benchmark, while Prometheus was running.

This is interesting for several reasons: First, I am unable to use R correctly and the graph is called “red” ¯\_(ツ)_/¯. Second, we see performance rise and drop in waves. Third: some outliers are as fast as the initial events, which by the way are less than 50k now.

Since 22 out of 24 CPU cores were doing nothing I could draw a first conclusion: The benchmark results are highly susceptible to disk-IO. This is in line with the initial assumption, that the DB transaction rate will turn out as important. So how do we make the DB go as fast as for the first few events? What causes the drop of performance in the first place?

Did we just change the observation by measuring it? | From here.

Interim summary

Let’s quickly summarize what we have found out so far:

Neither Axon nor the JVM are causing us trouble
The Postgres DB or rather our configuration of it is our bottleneck
Some kind of disk IO we did not yet identify heavily influences performance

Obviously, we need to identify what the DB does with the disk that changes after a few seconds.

Another detour: MongoDB

Since I already spent about a day at this point and did not even start comparing PostgreSQL to another technology, I fired up MongoDB and changed Axon to use it as its event store. Who knows maybe the already observed performance is not that bad after all.

So by simply switching the maven dependencies, Axon allows us to use MongoDB. In the following graph, we see the first 250K events, in blue we have the data from Postgres as before and in green, we see the behavior when using MongoDB.

So can we already conclude that PostgreSQL is the wrong tool?

Let’s not jump to conclusions too quickly…

Photo by Prasesh Shiwakoti (Lomash) on Unsplash

We need talk about ACID for a moment. With the default configuration of mongo, we did just threw the ACID guarantees over board. The D in ACID, Durability, means that once a transaction is committed, it will survive crashes of the system. This means in our case, it has to be written to persistent storage, before we tell the caller that the transaction is successful. This is a bit simplified, just read up on 2PC on Wikipedia if you want to know more. To summarize: with Postgres we require to sync everything to disk for each and every event, which is not required for mongo in its default config. Postgres uses one of thefsync related syscalls for this. For the sake of this article, we will call them all fsync, even if Postgres actually uses fdatasync in the benchmark.

But we can tell Postgres to act like mongo in this regard. Let me show you the single line that allows you to forget everything you learned about databases in university:

synchronous_commit = off

This tells Postgres to not wait for a sync to disk when committing. This allows for better performance. Like seriously better than before, Postgres without sync in red:

Blue: Postgres, green: Mongo, red: Postgres without sync commit

Now we can draw a few conclusions:

PostgreSQL and MongoDB are equally performant when we do not require ACID compliance.
Syncing to disk seems to be related to the performance drop

We need to discuss whether using an Event Store without ACID is a valid solution, but this is the topic of a dedicated post. For now, we still need to figure out what causes the weird behavior for the first few events.

Go deeper?

To be frank, I was afraid I did delve too deep and wanted to stay with Postgres for a while before going any further. Therefore I tried out about a bazillion different combinations of Postgres settings, ranging from shared buffers to deactivated indices, autovacuum, or fiddling with WAL size. All in search of why things get slower. What part of Postgres is filling itself before performance degrades? Turning on logging and crunching the logs with pgBadger did only reveal things we already knew from the profiler: The commit call did take most of the time.

Then it hit me, what if it is not the usage of fsync by Postgres that gets slower but rather fsync itself? If only we could try fsync without Postgres…

Well, we can.

There is a ready-made tool called pg_test_fsync that tests the different sync methods. So I fired up the tool with the -s argument to define how long the test should run and used one configuration for it to stay within the time it takes for 50K events and one with way more time and compare the two.

> pg_test_fsync -f ./syncs.bin -s 2
...
fdatasync                           987.984 ops/sec    1012 usecs/op> pg_test_fsync -f ./syncs.bin -s 120
...
fdatasync                           298.926 ops/sec    3345 usecs/op

Nice, this roughly resembles the values we observed in the two phases! This means, the amount of executed fsync calls directly correlates to the performance of each execution.

Ok, time to go a bit deeper.

Buffers?

I was at a complete loss, where to look now. I, therefore, asked a friend for advice. They recommended me this link, explaining the output of /proc/meminfo . (Un)fortunately, the values did change quite fast, so I started node_exporterto have these values captured by Prometheus.

Below, you can find a graph showing the number of dirty writes (blue) and the rate of Postgres transactions (green) over time.

Unfortunately, the dirty writes are showing the same pattern longer than the Postgres DB shows good performance. So I assume this buffer is not the root cause of our problems. The other values in /proc/meminfodid not help either.

Then I scrolled through the other node_exporter graphs until I found something quite suspicious.

See the large, orange area just below the red lines? There’s our problem. It turns out, the temperature quickly rises when the benchmark starts. So let’s glue together a new dashboard with the relevant graphs and have a closer look.

The temperature reaches 80°C just the moment the performance is reduced. Repeatedly. Every time. It seems the fsync performance is limited by something as trivial as temperature.

There is only one way to find out: We need to do stupid things in the name of science!

Running Postgres at about 2°C. <|> Now the 30m cat6 cable finally pays off

When cooling down the PC to about zero degrees by running it on the veranda, the benchmark runs longer without dropping in performance. I would argue that the assumption about temperature-related throttling is hereby confirmed.

IOPS

We found that the throttling caused by temperature affects the IOPS but are the IOPS not too low to start with? Even when the SSD does not throttle, it has only about 1K fsync ops/s. Let’s compare this with some shared server rented in the cloud:

local:   fdatasync     987.984 ops/sec    1012 usecs/op
VPS:     fdatasync    3691.707 ops/sec     271 usecs/op
(note: these are the values without throttling)

Even in a shared environment, the server hardware is way faster than the dedicated SSD. So our takeaway here should be, that it makes sense to choose your hardware or cloud vendor based on your requirements if performance is a concern.

Bonus: Can you hear performance?

After giving up on my SSD, I configured an old rack server I had in my basement with 3 10K SAS drives to see how the software behaves on different hardware. When running the benchmark, I could not resist putting the microphone next to it and record a few seconds of audio.

Please don’t ask how or why the different frequencies are generated.

Conclusion

We saw that tracking down weird behavior from top to bottom, while eliminating as many side effects as possible, works. But not only that: By tearing down a system slice by slice, we can learn a lot along the way. Building systems that can be analyzed in layers is one of the design principles encountered often in computer science, for example, the OSI layers. We have also seen that tight monitoring of your system with as many metrics as possible and reasonable might help you track down problems faster.

As mentioned in the introduction, I am preparing a blogpost regarding the performance of different event stores for Axon. Stay tuned for my next post, covering the results of that experiment.

Thanks for reading! If you have any questions, suggestions, or critique regarding the topic, feel free to respond or contact me. You might be interested in the other posts published in the Digital Frontiers blog, announced on our Twitter account.