I’ve written quite a few libraries the past few years. But no library was as educative and interesting as BSON. BSON is a format made by MongoDB that is like JSON, but binary. It can be read and written very efficiently.
Our BSON implementation takes the original buffer that we receive and can fetch anything you want with a high performance using just the original buffer. So if you receive this data from the socket, often, you don’t need to copy the buffer. When I was giving myself a stab at JSON, the results were drastically different.
BSON is a clever format
The format of a Document is roughly like this, but in binary:
[(BSONType, KeyString, BSONValue)] . Any type that is empty has an empty (no) value. Any type that has a constant length (Int, boolean) can be parsed and mutated inline without any trouble. If you have a type that has a variable length, like an array, (sub-)Document, String or Binary blob, things change slightly.
These values have a 32-bits integer in front of the data so that if you decide you don’t need this key, you can skip beyond it without having to read the actual value. This way, if you have a User with 5 keys and only need key 4, barely any effort is spent on the keys before 4 and none is spent on key 5.
In our library we keep a small cache of all types and the offset we found them at. So that if we decide we need key 3, we don’t need to even search the keys before that anymore. This cache is a small optimization, but really pays off in larger data sets.
JSON is a slow format
While it’s extremely easy to parse for humans, the opposite is true for computers. If you receive an object with nested objects, the computer needs to parse all of the contents before it knows where the current object ends. Storing JSON in-memory in just the original format is a terrible idea for performance.
So I started thinking about BSON and came to the conclusion that here, too, we can choose to not copy the entire String but keep a cache to the original buffer. This way if you don’t need all keys and values from your JSON, barely any effort is spent on reading those unused values. So I started tinkering with the idea of building a cache, almost identical to BSON, but with a few small changes.
- Booleans are stored as separate types (boolTrue and boolFalse). There is no need to parse a boolean’s text if it was already scanned in the index.
- Strings can not contain escaping characters. These escaping checks/conversions don’t have to be executed if the String wasn’t escaped. We need to check this to see if an escaping character escapes an end-of-string quote
- Numbers store their type as integer or floating type depending on the characters in the literal. This can lighten the parsing.
- Strings and numbers can be parsed lazily, and we don’t need to copy them. This saves memory usage primarily.
So after implementing this I was happy to see that with the cache, it outperformed Foundation with the same JSON data. But I was greedy for more.
A better cache than BSON
When I looked at the profiler, I noticed almost all performance was going to reading and writing the cache. I thought this made sense, but a lot of memory operations shouldn’t need to be this heavy. That’s when I remembered how I structured my cache.
The cache would create a separate list of keys and values for arrays and objects. This means that a new memory allocation is done for every object and array. This needs to be loaded into your CPU every time you traverse the struct in Codable. So the solution is to reduce the amount of memory usage, but how?
What I came up with is a single pointer for the cache. If I store all data within a single buffer, like the BSON format itself, it can be traversed easily. And unused data will be mostly harmless.
With a bit of work at 3AM the cache was complete. The results were amazing. The tests took 25ms in the build with the initial cache, and with the new cache only 17ms.
So far there’s only one downside. The cache itself uses 70% of the memory that the original JSON data used just by storing the offsets and lengths. For the next update I’ll look into making the offsets and lengths more compact, although it would reduce the size of JSON data that can be parsed.