Postgresql compression

Alex Kuk
5 min readJan 20, 2020

--

Which options do we have?

— Auto-compression TOAST-able data

Postgres Pro compression / encryption

Cstore_fdw columnar store extension

ZSON extension for jsonb format

— Compression file systems (like ZFS or Btrfs)

TimeScaleDB for time-series data

This article is divided into 4 parts:

— First (about auto-compression, postgres pro & Cstore_fdw)

— Second (about ZSON)

— Third (TimeScaleDB)

— Fourth (Compression file systems)

Postgresql auto-compression

Briefly: Postgresql hasn’t row- or page-compression, but it can compress values more than 2 kB. The compressor with default strategy works best for attributes of a size between 1K and 1M.

In detail: Postgresql uses TOAST (The Oversized-Attribute Storage Technique).

PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows.” ©

In simple words, TOAST is about how your data stores and compresses.

The TOAST management code is triggered when a row value is wider than TOAST_TUPLE_THRESHOLD bytes (2 kB by default), and will compress and/or move field values out-of-line until the row value is shorter than TOAST_TUPLE_TARGET bytes (normally 2 kB) or no more gains can be had.

You can set TOAST_TUPLE_TARGET in storage parameters when creating or altering table:

ALTER TABLE … SET (toast_tuple_target = N)

Valid values are between 128 bytes and the (blocksize — header), by default 8160 bytes.

Also, you can use ALTER COLUMN | TABLE SET STORAGE … to change the strategy or disable toasting on the column entirely. Let’s check the manual for storages description:

PLAIN prevents either compression or out-of-line storage; furthermore it disables use of single-byte headers for varlena types. This is the only possible strategy for columns of non-TOAST-able data types.

EXTENDED allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big.

EXTERNAL allows out-of-line storage but not compression. Use of EXTERNAL will make substring operations on wide text and bytea columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed.

MAIN allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough to fit on a page.)

The compression technique is a fairly simple and very fast member of the LZ family. Let’s see at the pg_lzcompress.c source code for more details:

The compressor with default strategy works best for attributes of a size between 1K and 1M.

And here is a difference between two included strategies:

static const PGLZ_Strategy strategy_default_data = {   32, /* Data chunks less than 32 bytes are not compressed */   INT_MAX, /* No upper limit on what we’ll try to compress */   25, /* Require 25% compression rate, or not worth it */   1024, /* Give up if no compression in the first 1KB */   128, /* Stop history lookup if a match of 128 bytes is found */   10 /* Lower good match size by 10% at every loop iteration */};static const PGLZ_Strategy strategy_always_data = {   0, /* Chunks of any size are compressed */   INT_MAX,   0, /* It’s enough to save one single byte */   INT_MAX, /* Never give up early */   128, /* Stop history lookup if a match of 128 bytes is found */   6 /* Look harder for a good match */};

What about jsonb?

Some time ago jsonb had a problem with the TOAST compression, because the leading portion of the object consisted mostly of a strictly-increasing series of integer offsets. It was fairly incompressible, because the code in pglz_compress() gives up if it’s found nothing compressible in the first first_success_by bytes of a value (this is 1 KB in the default strategy, you can see it above). So, in this thread guys solved this problem by changing item offsets to item lengths and that’s why now the jsonb header looks like this:

* JEntry format.* The least significant 28 bits store either the data length of the entry, or its end+1 offset from the start of the variable-length portion of the containing object. The next three bits store the type of the entry, and the high-order bit tells whether the least significant bits store a length or an offset.* The reason for the offset-or-length complication is to compromise between access speed and data compressibility. In the initial design each JEntry always stored an offset, but this resulted in JEntry arrays with horrible compressibility properties, so that TOAST compression of a JSONB did not work well. Storing only lengths would greatly improve compressibility, but it makes random access into large arrays expensive (O(N) not O(1)).* So what we do is store an offset in every JB_OFFSET_STRIDE’th JEntry and a length in the rest. This results in reasonably compressible data (as long as the stride isn’t too small). We may have to examine as many as JB_OFFSET_STRIDE JEntrys in order to find out the offset or length of any given item, but that’s still O(1) no matter how large the container is.

Conclusion:

Postgresql has normal balanced default auto-compression, but it’s not always enough.

Postgres Pro Compression

You can find official info about compression on database level here.

Postgres Pro brings the page-compression to postgresql world and uses zstd (from Facebook with love) compression library on Linux and zlib on Windows. Developers claim that their implementation can reduce database size from 2 to 5 times.

To use compression you need to create a tablespace:

postgres=# create tablespace compressed_tables location ‘/var/data/cfs’ with (compression=true);

Compressed once the tablespace can’t be decompressed. And also you can’t compress tablespace which was created earlier.

Cstore_fdw compression

You also can look at cstore_fdw — is an open source columnar store extension for PostgreSQL. it may compress data 6x-10x (according to the developers).

But anyway I didn’t try it, because it has some restrictions, like:

“We currently don’t support updating table using DELETE, and UPDATE commands. We also don’t support single row inserts”

So, I leave to try it to you.

Common conclusion.

As I said above, Postgresql has normal balanced default auto-compression, but it’s not always enough. If you have large tables with many rows and many columns, where each value does not exceed 2 KB, you may face a storage problem, because your data won’t be auto-compressed.

In the next articles we’ll consider the rest of compression options.

Feel free to shake hand on Medium or Telegram or Twitter.

--

--

Alex Kuk

java development, volleyball, performance engineering, kudo, deep learning