Benchmarking PostgreSQL with Different Linux Kernel Versions on Ubuntu LTS

Recently Josh Berkus published an article Why you need to avoid Linux Kernel 3.2 claiming that this particular version has significant IO performance issues. The article shows two graphs reflecting a dramatic IO utilization difference for kernel versions 3.2 and 3.13, but, unfortunately, due to some privacy reasons, it does not provide a deeper look into the problem. This article is an attempt to fill this gap by researching IO behaviour of 6 kernel versions supplied with the latest 3 Ubuntu LTS releases, and, probably, to provide some additional insights.

Let’s take a look at the testing environment and the way of benchmarking.

Host machine:

  • Lenovo ThinkPad W540, 16GB memory, 4 cores x 2 logical cores (i7–4800MQ 2.7 GHz), Samsung Lenovo 256GB SSD drive (MZ7TD256HAFV-000L9), NUMA off;
  • Ubuntu 14.04 (64 bit), VitrualBox 4.3.18.

Virtual machines:

  • 4GB memory, 4 processors, 40GB storage (Normal, VMDK), acceleration VT-x/AMD-V, Nested Paging, PAE/NX;
  • Ubuntu 10.04 (64 bit), 2.6.32–38-server;
  • Ubuntu 12.04 (64 bit), 3.2.0–70-generic / 3.5.0–54-generic / 3.8.0–44-generic / 3.11.0–26-generic;
  • Ubuntu 14.04 (64 bit), 3.13.0–24-generic;
  • PostgreSQL 9.3.5, iostat.

The OS/FS tweaks below are based on the Database Server Configuration notes from pgcookbook.

blockdev settings:

echo noop > /sys/block/dm-0/queue/scheduler
echo 16384 > /sys/block/dm-0/queue/read_ahead_kb

File system configuration:

/dev/mapper/vagrant—vg-root on / type ext4 (rw,noatime,nobarrier)

Kernel tuning:

fs.file-max=65535
kernel.sched_autogroup_enabled=0
kernel.sched_migration_cost_ns=5000000
kernel.shmall=1011818
kernel.shmmax=4144406528
vm.dirty_background_bytes=8388608
vm.dirty_bytes=67108864
vm.zone_reclaim_mode=0

kernel.sched_autogroup_enabled is not set for 2.6.32–38-server because this particular version does not support it. Also, an extra test suit has been made for 3.13.0–24-generic with kernel.sched_autogroup_enabled set to 1, that shows some interesting results.

PostgresSQL configuration:

autovacuum_analyze_scale_factor = 0.05
autovacuum_max_workers = 5
autovacuum_naptime = '5s'
autovacuum_vacuum_cost_delay = '5ms'
autovacuum_vacuum_scale_factor = 0.05
bgwriter_delay = '10ms'
bgwriter_lru_multiplier = 10.0
checkpoint_completion_target = 0.9
checkpoint_segments = 256
checkpoint_timeout = 3600
checkpoint_warning = 720
default_statistics_target = 1000
effective_cache_size = '2816MB'
effective_io_concurrency = 4
hot_standby = on
log_destination = 'stderr'
log_lock_waits = on
log_min_duration_statement = 1000
log_statement = 'ddl'
logging_collector = on
maintenance_work_mem = '240MB'
max_connections = 200
shared_buffers = '960MB'
shared_preload_libraries = 'pg_stat_statements'
synchronous_commit = off
temp_file_limit = '10GB'
track_activity_query_size = 4096
track_io_timing = on
wal_buffers = '-1'
wal_keep_segments = 512
wal_level = 'hot_standby'
work_mem = '20MB'

The host machine’s blockdev, kernel and filesystem settings are in sync with VMs, except the mentioned above kernel.sched_autogroup_enabled cases.

The benchmark script performs the next 3 actions:

  1. runs pgbench initialization with the scale factor 1000, that makes the data set approximately 4 times bigger than the memory amount (15GB);
  2. runs iostat in the background and the read-write pgbench test with 16 concurrent database sessions and 4 test workers for 300 seconds;
  3. runs iostat in the background and the select-only (-S) pgbench test with 16 concurrent database sessions and 4 test workers for 300 seconds.
sudo sudo -i -u postgres
mkdir /tmp/test/$(uname -r)
pgbench -i -s 1000
mkdir /tmp/test/$(uname -r)/pgbench-rw
cd /tmp/test/$(uname -r)/pgbench-rw
iostat -xk 1 300 > iostat.log &
pgbench -c 16 -j 4 -r -T 300 -l --aggregate-interval=1 \
> report.txt
qmkdir /tmp/test/$(uname -r)/pgbench-ro
cd /tmp/test/$(uname -r)/pgbench-ro
iostat -xk 1 300 > iostat.log &
pgbench -S -c 16 -j 4 -r -T 300 -l --aggregate-interval=1 \
> report.txt

pgbench will give us an information about TPS and statements performance, and iostat will provide an IO consumption details.

In the read-write mode pgbench runs the sequence of statements shown below in transaction:

UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);

In the read-only mode it is just a single SELECT statement:

SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

Let’s see how the results look like with different kernels. Here are the sources of the test reports.

Read-write tests.

Transactions per second:

  • 2.6.32 showed the best result of 1215.73 TPC;
  • 3.2.0 has a noticeable ~24% degradation down to 931.40;
  • 3.5.0, 3.8.0 and 3.11.0 are ~7–8% less too with 1137.94, 1130.15, 1071.64 TPS;
  • 3.13.0 gave a good result of 1177.90 TPS that is ~3% less than 2.6.32;
  • 3.13.0 sched_autogroup_enabled was ~16% slower with 1021.82.

The statements time graph basically reflects the TPS graph with no surprises except one tiny observation that is not easy to notice on the graph but it becomes more clear if you take a closer look at the sources, particularly pgbench.csv. It looks like the 3.13.0–24-generic kernel slightly outperforms the others on more IO bound queries, like the first UPDATE, but is not so effective on the less IO bound ones, what makes me guess that it probably spends more time on a query context processing in some cases.

Read-only tests.

Well, it definitely proves Josh’s words about avoiding Linux Kernel 3.2.

Transactions per second:

  • the winner is 3.13.0 sched_autogroup_enabled with 2157.25 TPS;
  • next goes 2.6.32 with 2056.07 TPS that is ~4.5% less;
  • 3.5.0, 3.8.0, 3.11.0 and 3.13.0 are ~11–13% less that is 1896.39, 1912.41, 1875.32 and 1877.33 TPS;
  • and 3.2.0 is the worst one with just 1029.82 TPS that is (!) ~52% slower.

The statements time graph fully reflects the situation giving us 7.75 ms and 15.50 ms average time in the best and worst cases.

Now let’s look at the IO behaviour through the tests.

Read-write scenario.

Pattern comparison:

  • 3.5.0, 3.8.0 and 3.11.0 have exactly the same behaviour pattern with reads trending ~70 MB/s and writes ~30 MB/s after reaching a stable cache state;
  • 2.6.32 has the same reads and higher writes of ~50 MB/s;
  • 3.2.0 has a really high bounce rate;
  • 3.13.0 shows a really good, 3 times lower, reads of ~20 MB/s, having the writes pattern barely noticeable higher;
  • and the kernel 3.13.0 with sched_autogroup_enabled shows even less writes of ~28 MB/s with the same low reads of ~20 MB/s.

That last two kernels huge reads behaviour difference looks pretty interesting, doesn’t it? But that’s still not it.

Read-only scenario.

Pattern comparison:

  • just the way it was for the read-write tests, 3.5.0, 3.8.0 and 3.11.0's patterns are very similar for the select-only tests with ~95 MB/s reads and ~4 MB/s writes;
  • 2.6.32 has about the same reads and slightly higher writes with a stable bouncing pattern;
  • 3.2.0's reads are much lower in the beginning ~30 MB/s and after the cache stabilization ~60 MB/s, that was expected for such a low TPS showed above, and the stabilization time is around twice higher, by the way;
  • interesting thing is that 3.13.0, having a very good read-only TPS, showed the dramatically low reads of ~30–32 MB/s with a really low bouncing rate and writes ~6MB/s, what might tell us that this kernel has some serious IO issues fixed, probably related to the shared memory management;
  • 3.13.0 with sched_autogroup_enabled shows slightly higher reads of ~37 MB/s and slightly lower writes ~4 MB/s, that relates to the highest read-only TPS.

In conclusion I would like to put down these three points:

  1. if you are on the kernel 3.2.0, or probably any 3.2 one, than it is worth to upgrade it for a better performance;
  2. you should upgrade to 3.13.0, or probably a later version, due to the IO issues fix that might dramatically affect the performance of your database for a good;
  3. if your database is mostly read-intensive then set kernal.sched_autogroup_enabled.

It would also be very interesting to benchmark newer kernels’ performance, as well as to check PostgreSQL behaviour with different kernel settings, and I’m planning to do it in the further articles. Stay tuned.

ps. A question to the Linux FS/MM group — what might be a reason of such a huge difference between the 3.13.0's IO pattern and the previous versions’ ones?

The article is supported by Nitro Software, Inc.

At Nitro, we’re changing the way the world works with documents. From the desktop to the cloud, we make it easy to create, edit, share, sign and collaborate — online or offline.

The Nitro Engineering team is doubling in size this year. We are Reactive www.gonitro.com/reactive.