First and foremost, you probably don’t need a database system to handle small JSON files, ranging to a few megabytes.
However, if you have to manage and query GBs of data, you should use a database system.
Usually, database systems, however, are not designed to keep the full history of your data. Often, the system overwrites data during a change or keeps the data for a short time. The latter usually happens due to transactions, which currently read slightly outdated data. Thus, a garbage collector has to wait until all reading transactions finish. Then, it’s able to delete the old data.
Instead, SirixDB makes a huge persistent tree, durable during commits. It only ever appends data. Every revision is indexed, whereas the revisions share unchanged page-fragments. Think of it like Git, but on a sub-file level. Persistent trees are also common in functional languages like Haskell and Closure. A transaction commit, which serializes pages in a postorder traversal, is depicted in the following image:
Usually, mapping these persistent in-memory structures to a durable device adds write amplification, as the full leaf-to-root path has to be adapted. We, however, use a keyed trie for our main document store. Thus no structural changes occur as in B-trees. Furthermore, we developed a new sliding snapshot algorithm to version the data pages and store variable-sized page fragments. Therefore only a few changed records are written to a new page-fragment. Reconstructing an in-memory page involves reading a sliding window of page-fragments from random locations in parallel.
Due to the requirement for fast, random, fine granular reads, modern hardware as Byte Addressable NVM will be essential to good performance.
SirixDB stores the JSON data in a binary format, whereas each transaction is bound to a specific revision. Currently, N read-only transactions on a resource can coexist with one read-write transaction on a resource (represents the JSON data).
The encoding is as follows, except that we lately also introduced a pointer from the parent to the last child node.
Query Your Data
The system uses and extends an XQuery 3.0 processor to process both XML and JSON data.
We’re able to import JSON files from a specific directory via:
It’ll create a database named
mycol.jn and several resources.
We can also store a bunch of JSON-strings within several resources in the database (mycol.jn):
Note, that by when JSON-strings we have to replace
false() as well as
We can then query the database collection via:
Or, if we want to open a specific resource in the database, we can use:
Furthermore, we can update the resource. Let’s say we have the following small JSON file:
We can insert a JSON object with:
We dereference object field names with the
=> operator. It even can drill down into arrays and find the value for the field name.
The update-operation will create a new revision with the field name and the added value. For instance
will select the first object in the array value of the field name
We implicitly query revision 2. If we instead query revision 1, we’ll still retrieve the old JSON object.
Note that we specified one as the third argument of the function. The output is:
Instead of providing a numeric value to the third parameter we can also open a specific revision by a timestamp (here: to check how a specifc resource looked like in April 2018):
With the function open-revisions we’re able to load all revisions of a resource between two points in time:
The following query will insert an object into the array value dereferenced by
=>foo as the second item:
we get the output:
Likewise, we can replace a JSON value with
or delete a value:
If you submit the queries through the REST-API, you first have to be authorized. Thus, a revision, as in Git, also stores the author’s name and a UUID.
To get the author name, who committed a specific revision:
To get the UUID:
You can also project fields:
This will output:
Time travel functions
Besides, we can use a bunch of time travel queries. The following functions retrieve different versions of a JSON item:
Function for selecting the json-item in the future or the future-or-self. The first parameter is the context item. The second parameter denotes if the current item should be included in the result or not.
Function for selecting the json-item in the past or the past-or-self. The first parameter is the context item. Second parameter denotes if the current item should be included in the result or not.
Function for selecting the json-item in all revisions.
Function for selecting the json-item in the first revision.
Function for selecting the json-item in the last / most-recent revision.
Function for selecting the json-item in the previous revision.
Function for selecting the json-item in the next revision.
To get an item in all revisions, in which it has been changed (or inserted):
We’re of course also able to retrieve diffs:
This function compares revision one and three of a resource called
The output format is a JSON-string:
For instance, this format is also used by our , which we currently develop based on Svelte.
It denotes that SirixDB compared revisions 1 and 3. Furthermore, the different diff types are
insert, delete, update and replace (the latter not shown). The
insertPositionNodeKey is the context node, where an insert occurs, the
insertPosition denotes if it's inserted as a first child, as the right sibling, or the left sibling.
We can also diff a subtree and specify two additional parameters. The
root node, which is going to be compared, and the
This compares also revision 1 and 3 of the
myresource2 resource in the
mydocs.col database. However, this time, the diff starts at the node, denoted by its unique nodeKey 7453. Furthermore the diffing should skip descendants, which are deeper than 2 levels.
In order to get the
nodeKey of a specific item, we can use:
SirixDB offers powerful ways to query the full history of your data.
We omitted the creation of index structures, but you can also create secondary index structures with XQuery, which are automatically versioned as well.
Consider the following document (serialized SirixDB resource with two revisions):
We can create a Content-And-Structure (CAS) index as follows:
In order to answer the query
Furthermore, we didn’t mention FLOWR expressions, which are the cornerstone of XQuery. A simple example with an implicit join: