How to serialize JSON into BSON

Various schemes have been used to transfer data between different layers of a software application. XML files, POJOs and JSON objects are good examples. Among these, JSON objects are gaining increasing popularity due to their simplicity and versatility. With increasing number of applications which intensively use JSON objects, such as NoSQL databases, utilization of the binary form of these data models is likely to grow.

One of the forms in which JSON objects can be written in a binary format is a Binary JSON (or in short BSON) document. A BSON document is a serialized binary form of JSON-like data formats.

The concept of BSON was originally developed in 2009 at MongoDB with the intention of having a lightweight, traversable and efficient data storage mechanism. Although MongoDB drivers abstract its implementation, any MongoDB data is internally stored as a BSON document. Much of the indexing and querying performance boosts attained in MongoDB can also be attributed to the implementation of this concept.

Currently, the use of BSON documents seems to be confined to MongoDB. However, their simple structure makes them a good candidate to be used in other applications as well. The main disadvantage of BSON documents is that they are less efficient in space consumption as they need to store serialized field names for every entry.

In this article, we will discuss the standard way of serializing JSON objects into BSON documents. You can also refer to this in demand npm module which is designed to parse BSON documents.

Prerequisite

In our discussion, we will use a comprehensive approach. As a result, having the basic knowledge about JSON and binary data is sufficient.

The BSON Specification

BSON documents are prepared in accordance with the specification outlined on bsonspec.org website. The specification briefly highlights the formats to be used while constructing BSON documents. We will first revise these formats. Then, we will illustratively examine how they are implemented in the forthcoming sections.

Regarding data types, the specification states that all BSON documents are byproducts of six basic data types. They are byte, int32, int64, uint64, double and decimal128. In addition, the specification stipulates that the numeric data types must be serialized in little-endian format. When it comes to allowed data types, all those which are part of the JSON spec. can be written as a BSON. However, BSON documents exclusively include other data types such as raw binary data and date objects.

In general, any BSON document is comprised of a 32-bit signed integer, an e_list and a zero-terminator. The 32-bit integer stores the total length of the document where as the e_list is the binary representation the key-value pairs. Diagrammatically, these components are portrayed as:

Figure 1: Format of a BSON document

Let us look into the e_list in detail. Basically, it is the concatenation of serialized key-value pairs called elements. The elements are comprised of three components: a 1 byte code (which describes type of the value), a null-terminated string (key) and the the value in binary form. However, if the JSON doesn’t have any entry, the element and thus the e_list will be a zero-length byte.

For an empty JSON ({}), the total length of the BSON document will be 4 (int32) + 0 (e_list) + 1 (zero-terminator) = 5 bytes. Thus, a BSON representation of an empty JSON is:
<Buffer 05 00 00 00 00>

The BSON codes for JSON data types are tabulated as below.

Table 1: BSON codes for JSON data types
For the JSON object { "abc": 5 }, the value is 5 which is a 32-bit integer -> its type code is 0x10. A null-terminated key string "abc" in binary form is [0x61 0x62 0x63 0x00]. The integer value (5) written with little-endian format as [0x05 0x00 0x00 0x00]. Thus, the bytes in the e_list are:
<Buffer 10 61 62 63 00 05 00 00 00>

The length of the e_list is 9. The total length of the BSON will become 4 (int32 for length) + 9 + 1 (zero-terminator) = 14 (0x0e). The BSON document finally becomes:
<Buffer 0e 00 00 00 10 61 62 63 00 05 00 00 00 00>

Let us consider another example.

For the JSON object { "abc": true, "def": "mybson" }, the BSON element for "abc": true is written as: code for boolean (0x08) + null-terminated key ("abc") + value for true (0x01), resulting in:
<Buffer 08 61 62 63 00 01>
For the "def": "mybson", the BSON element consists of: code for string (0x02) + null-terminated key ("def") + length of null-terminated string (7) stored as int32 (LE) + null-terminated string value "mybson\0". This results in: <Buffer 02 64 65 66 00 07 00 00 00 6d 79 62 73 6f 6e 00> The total length of the e_list is 6 + 16 = 22. The total length of the BSON will be 27 (0x1b). The BSON document now becomes:<Buffer 1b 00 00 00 08 61 62 63 00 01 02 64 65 66 00 07 00 00 00 6d 79 62 73 6f 6e 00 00>

Serialization of arrays

The serialization of arrays is a little different from what is expected and we will discuss it separately. To serialize arrays, first we need to put them in their equivalent JSON forms. For this, each element in a given array is converted to a key-value pair where the key is the index of the element in the array.

[“A”, “B”, “C”] is converted into { “0”: “A”, “1”: “B”, “2”: “C” }

After the conversion, arrays become regular JS objects. If so, how is the distinction made between them? The answer is by using the code that precedes the key. For regular JS objects the key is 0x03 where as for arrays it is 0x04.

Let us see this via example.

The JSON { "abc": [1, 2, 3] } is converted into BSON as follows. First, convert the array into equivalent JS object and it becomes:
{"0": 1, "1": 2, "2": 3 }
Since this is a JSON by itself, convert it into independent BSON.
<Buffer 10 30 00 01 00 00 00> // element for "0": 1
<Buffer 10 31 00 02 00 00 00> // element for "1": 2
<Buffer 10 32 00 03 00 00 00> // element for "2": 3
The total length of the BSON will be: 3*7 + 4 + 1 = 26 (0x1a). The BSON for the array is:
<Buffer 1a 00 00 00 10 30 00 01 00 00 00 10 31 00 02 00 00 00 10 32 00 03 00 00 00 00>The type code for arrays is 0x04 and the null-terminated key ("abc") has a binary form of [0x61 0x62 0x63 0x00]. The total length of the final BSON becomes: 26 + 4 (key) + 1 (code) + 1 (null-terminator) + 4 (total length) = 36 (0x24) and it has a form of:<24 00 00 00 04 61 62 63 00 1a 00 00 00 10 30 00 01 00 00 00 10 31 00 02 00 00 00 10 32 00 03 00 00 00 00 00>

Deserializing BSON documents

The deserialization of BSON documents is as important as the serialization. For instance, in case of MongoDB, the serialization is useful to send JSON objects to the database. Where as the deserialization is required to parse database responses and put them as JSON objects.

To deserialize non-empty BSON documents, start from the index = 4 and extract each key-value pair piece by piece. At index = 4, we get the data type code of the first entry followed by its key. Since the key is always a null-terminated string, its length can be determined based on the location of the null-terminator. By using the type code, the extent of the value and thus the location of the start of the next entry can be determined. Through a similar iteration, the whole BSON document can be traversed.

Let us deserialize:
<Buffer 10 00 00 00 08 61 62 63 00 00 0a 78 79 7a 00 00>
Skip 4 bytes (length) and go to index=4 (starting from 0). The byte at index 4 is 0x08 which is the type of the first value which is a boolean type. Since booleans have a value of either 0x00 or 0x01, their length is 1. Starting from index=4, parse to the next index of 0x00 (index=8). The bytes between indices 4 and 8 [0x61 0x62 0x63] represent the first key. In utf-8, the string is: "abc". At index=9, we get the byte 0x00, which represents a false boolean value. Thus, the first key-value pair is:
"abc": false
At index=10, we get a byte of 0x0a which is a type code for the second entry (which designates the value is NULL and has zero length). After index=10, the next index of 0x00 is at index=14 and the bytes between indices 10 and 14 [0x78 0x79 0x7a] represent the key for the second entry. In utf-8, the string is "xyz". The second key-value pair now becomes:
"xyz": null
We have now reached the end of the BSON document as the last index (15) corresponds the last null-terminator. Altogether, the JSON becomes:
{ "abc": false, "xyz": null }

The above discussion covers the two way conversion between some common forms of JSON objects and BSON documents. The BSON specification covers the formats for all the possible data types including binary data, ObjectIds, regular expressions, and JavaScript codes. It is worth consulting if you want to further explore BSON documents.

I am a software developer with civil engineering background. I gained most of my programming skills through self-learning.