A deep dive into Binary JSON formats: JSONB

Loïc Lefèvre

Follow

Published in

db-one

4 min readSep 19, 2024

--

Continuing from the previous chapters:

Chapter 4 — JSONB format

First introduced with PostgreSQL 9.4, the JSONB format is really different from BSON. But more interestingly, there are 3 distinct JSONB formats:

the on-disk format, subject to The Oversized-Attribute Storage Technique (also known as TOAST) (see below notes)
the in-memory format that supports a bit more value types (more on that later), known as JsonbValue
the JSONB data type from SQLite 3.45.0, which — according to its author “is completely different and incompatible with PostgreSQL JSONB"

Following, we’ll deal solely with the on-disk format (first bullet).

Key capabilities

Hierarchical format: key/value pairs for the same level are embedded together. Objects and arrays embed their respective children. Per level, key/value pairs are sorted by key length, then alphabetically to allow for efficient binary search (does a field name exist?). Said otherwise, the JSONB format doesn’t keep fields ordering.

Simple JSONB representation. 308 bytes as JSONB vs 199 bytes (after spaces and carriage returns removed).

Pointers optimization: given a level inside the JSON hierarchy, the structure JEntry manages a set of ordered pointers and value types. (same order as keys). This set is optimized so that
Binary encoding for numbers: this avoids parsing over and over (See the picture above. The highlighted value in yellow is the number 2 using PostgreSQL short numeric encoding).
32-bit alignment for numbers and the identical-level-keys-list (See the picture above. The gray square surrounds 3 bytes added for padding).
Maximum size of 256 MiB
Offset/Length encoding of Jentry to speed up array processing and tradeoff with compressibility. Offset is used every 32 children.

Having only children entry length helps to compress JSONB since these are usually repeated. Having only offsets helps to quickly jump to values (scalar/child) but is far less compressible.

Additional notes

JSONB JEntry denotes the value data type using 3 bits, hence 8 values. As of today, 6 values are used for string type, numeric type, false value, true value, null value, and object/array entry type.
The in-memory JSONB format supports additional date, time, timestamp, timestamp with time zone types that require parsing strings.
Indexing: JSONB columns can be indexed contrarily to JSON columns.
null, true, and false values are encoded using 4 bytes. (See the picture above with orange underlined values).
UTF-8 string values do not support the \0 character.
Duplicate keys are “supported”; only the latest key/value pair appearing inside the UTF-8 text JSON is kept.
Partial updates of JSON documents are supported from an API standpoint, but PostgreSQL MVCC model defeats this from a storage standpoint (vacuum is important for large JSON documents).

TOAST

The JSONB data type can have a size of up to 256 MiB (after encoding). Since PostgreSQL uses 8 kiB blocks, it could not fit inside it. For this purpose (and for other types), TOAST is used to compress and/or slice the binary encoded document to store it “out of line” — meaning inside another table/index structure solely for that column.

The threshold size is usually 2 kiB, and depending upon the definition of the JSONB column, TOAST may compress the encoded bytes to move them compressed or not into this out-of-line structure.

What this means is:

depending upon the compression algorithm (PGLZ by default), you may hit a compression/decompression penalty or you could see storage increase
whenever you have a JSONB larger than 2 kiB, accessing fields without an index may lead to the “detoast syndrome”

For example, you may consider the LZ4 compression algorithm available since version 14 usually requires less CPU and can have a very good compression ratio.

CREATE TABLE collection (data JSONB compression lz4);

A 1 MiB JSON document on the left compressed down to 297.2 kiB with PGLZ and 70.2 kiB with LZ4.

Regarding the “detoast syndrome”, this means that whenever you need to access a value from a JSON field, for query filtering or during the projection, the JSONB document will need first be “detoasted”, thus adding an overhead proportional to the number of times you query a field:

Projecting 1, 2, 3, 4 times the same field with PGLZ compression (p95 in ms after 60,000 runs).

Projecting 1, 2, 3, 4 times the same field with LZ4 compression (p95 in ms after 60,000 runs).

Projecting 1, 2, 3, 4 times the same field with no compression (p95 in ms after 60,000 runs).

In the previous example, the JSON UTF-8 text data is 47,321 bytes long while the JSONB encoded value takes 65,667 bytes uncompressed, 12,012 bytes with LZ4 compression, and 9,717 bytes with PGLZ compression.

The typical queries used were:

-- only 1 document present in the table
select c.data->>'age' from collection c;
select c.data->>'age', c.data->>'age' from collection c;
select c.data->>'age', c.data->>'age', c.data->>'age' from collection c;
select c.data->>'age', c.data->>'age', c.data->>'age', c.data->>'age' from collection c;

In the next chapter, we’ll focus on OSON…

A deep dive into Binary JSON formats: JSONB

Chapter 4 — JSONB format

Key capabilities

Additional notes

TOAST

Written by Loïc Lefèvre