A deep dive into Binary JSON formats: BSON

Loïc Lefèvre
db-one
Published in
2 min readSep 18, 2024

Continuing from the first chapters:

  • History
  • What is a Binary JSON format?

Chapter 3 — BSON format

MongoDB invented the BSON format, which is specified here (we are at version 1.1 right now).

Key capabilities

  • Streaming format: key/value pairs are stored in the order they appear inside the JSON text document. Also, accessing a key/field requires parsing the previous ones (true when the document is not pre-loaded in RAM, which also denotes the importance of indexes).

This is a unique difference in the sense that other Binary formats don’t follow this rule which by the way is not specified by the JSON standard itself. The only way to guarantee order of fields/values is by using an array.

  • Binary encoding for numbers: this avoids parsing over and over
  • Optimized encoding for null, true, and false values: 1 byte
  • Extensible thanks to the 8 bits reserved for defining field type (up to 256 distinct data types possible): it allows storing in binary more data types than what the JSON standard requires such as UTC datetime (Unix epoch), ObjectIDs, binary, etc. plus it allows for user-defined data types
  • Extended JSON (ejson) notation: allows forcing data into specific data type encoding
// 16 bytes (128-bit IEEE 754-2008 decimal floating point)
"price": { "$numberDecimal": "1234.5" }
  • Client-side encoding/decoding available
  • Duplicate fields/keys are supported (even the special _id field).

Additional notes

  • Strings are terminated by a \0 as for C strings
  • Array elements are also stored as key/value pairs: each item has a key/field name corresponding to the position inside the array stored as a string
"array": [ 2, 4, 8, 16, 32 ]

// BSON encoded looks like:
"array": [ { "0": 2 }, { "1": 4 }, { "2": 8 }, { "3": 16 }, { "4": 32 } ]
Underlined in yellow, the array pseudo-keys
  • There is no extended data type for managing time zones. They are usually stored in their own field, and thus, all datetime values have a time zone equal to UTC.
  • MongoDB storage engine WiredTiger compresses BSON documents using Snappy (the default). Other compression algorithms, such as Zlib and Zstd (Zstandard), are available.
Effect of compression on a 1 MiB JSON text document, encoded into 959 KiB in BSON, then compressed to 256 KiB (Snappy), 157 KiB (Zlib), and 21 KiB (Zstandard).
  • Partial updates of JSON documents are supported solely in RAM, not in storage. That’s why, over time, using the compact command is required to rewrite (and decompress/recompress) and defragment all data and indexes. Think of it as the famous vacuum process from PostgreSQL.

In the next chapter, we’ll focus on the JSONB format…

--

--