Streaming Structured JSON

JavaScript Object Notation (JSON) is perhaps the most ubiquitous way of transmitting data between the components of a SaaS application. It’s the native data format for web browsers and Node.js, with practically every other programming language providing libraries to serialize data to and from JSON.

In this article, we’ll discuss the idea of JSON Streaming — that is, how do we process streams of JSON data that are extremely large, or potentially infinite in length. In such cases, the JSON messages are too large to be held entirely in a single computer’s RAM, and must instead be processed incrementally as the data is being read from, or written to, external locations.

More specifically, in this article we’ll talk about streaming JSON data that has a non-trivial structure.

The Problem — Not all JSON is Predictable

The concept of JSON Streaming isn’t new, and numerous methods are documented on Wikipedia. However, most streaming techniques assume the elements of the JSON stream are predictable, and don’t depend on each other.

For example, a JSON stream that reports data from weather stations may consist of a sequence of JSON objects, separated by newline characters. The application reads each line as a separate record, without any need to load the entire data set into RAM. Using this model, we can process GB or TB of JSON data while only using KB of RAM!

{ "stationID": 1234, "temperature": 65, "wind": 12 }
{ "stationID": 1362, "temperature": 20, "wind": 23 }
...
{ "stationID": 1362, "temperature": 19, "wind": 24 }

However, what if the JSON contained multiple sections, with the first section providing meta-data necessary to understand the later sections? We no longer have a repeating pattern, but instead must store and update information in an internal database, as we progress through the JSON stream. In the later sections of the stream, we can refer back to that database to interpret the newly-arriving values.

For example, what if our weather data includes detail of each weather station:

{
"stations": [
{ "stationID": 1234, "city": "Seattle", "units": "imperial" },
{ "stationID": 1362, "city": "Vancouver", "units": "metric" },
...
],
"reports": [
{ "stationID": 1234, "temperature": 65, "wind": 12 },
{ "stationID": 1362, "temperature": 20, "wind": 23 },
...
{ "stationID": 1362, "temperature": 19, "wind": 24 }
]
}

In this example, we must first read the stations array to determine whether each weather station reports in metric or imperial units. When we later process the reports array, the values for temperature and wind will be scaled appropriately.

The concern here is that the JSON input is no longer trivial or repeating. Instead, some elements of the JSON object depend on values provided in previous parts of the same object. Our example is fairly simple, but imagine a more complicated JSON object structure with more dependencies between them. The basic JSON streaming approaches mentioned in Wikipedia are simply not going to help.

Before we discuss solutions, it’s worth mentioning an important assumption. Experts will note that JSON objects are an unordered collection of key/value pairs. For our purposes, however, we need to assume the stations key appears earlier in the stream than the reports key. The software generating the JSON stream must abide by this rule.

Let’s look at the architecture of how this can be solved.

The Big Picture

The following diagram illustrates our overall solution for reading a stream of JSON, while maintaining derived information as we progress through the input. The two main components we should focus on are the Tokenizer and the State Machine.

In this model, the input is a sequence of text characters (streamed from a file, or from a network connection), which is tokenized into the basic building blocks of a JSON object (such as StartOfObject or StringValue — more on these later). We then use a state machine to traverse the JSON object’s structure and pull out the interesting values. As we progress through the JSON object (by transitioning between states), we update the database accordingly. Finally, the transformed data is sent to the output.

Let’s look at those steps in more detail.

The Tokenizer

This component of our pipeline reads a continuous stream of characters from the input. Its job is to group the input characters into meaningful atomic tokens. To illustrate, let’s revisit our earlier example:

{
"stations": [
{ "stationID": 1234, "city": "Seattle", "units": "imperial" },
{ "stationID": 1362, "city": "Vancouver", "units": "metric" },
...
],
...
}

In this example, the Tokenizer outputs the following stream of tokens:

StartOfObject
FieldName("stations")
StartOfArray
StartOfObject
FieldName("stationID")
NumberValue(1234)
FieldName("city")
StringValue("Seattle")
FieldName("units")
StringValue("imperial")
EndOfObject
...
EndOfArray
...
EndOfObject

If you read carefully through the stream of input characters, you’ll see a one-to-one mapping with the tokens sent to the output. We record the type of each token (such as FieldName), along with an optional data value (such as units).

It’s important to remember that this stream of tokens could be infinitely long, simply because the stream of input characters might be infinitely long. In reality, any JSON object that’s too large to fit into RAM is a candidate for this approach.

Streaming software generally reads input characters in small batches (for example, 4KB-8KB at a time). Although you might intuitively feel that streamed data should be processed one character at a time, that would be highly inefficient — we instead read a full disk block, or read a full network packet each time. When we run out of characters, we ask for the next block. Our memory footprint is therefore proportional to the size of an input block (such as 4KB), rather than the size of the entire JSON object.

Because of the way the token stream is created, we can also be confident the JSON object is syntactically well-formed. That is, all the open and close braces match, and the keys and values are paired correctly. However, we don’t yet know whether the JSON object is semantically correct. For example, we must still confirm that the "stations"key exists and it refers to a JSON array.

This is where the State Machine comes into action.

The State Machine

The purpose of a state machine is to remember which part of the JSON object we’re currently processing. In our weather station example, we start by scanning through the "stations" section while collecting meta-data about the location and measurement units of each station. Once we reach the end of the array, we then switch to a different state for processing the content of the reports array.

The following diagram shows a (partial) state machine for scanning through the stream of tokens, transitioning from one state to another based on the token’s type.

As with all state machines, we begin at the start state (top left) and progress from one state to the next, as we consume tokens from the input stream. If a particular state doesn’t have a transition for the next token in the input, the JSON object is considered invalid (we won’t discuss this situation). Given our example token stream, you should be able to trace through the state machine and imagine all the tokens being successfully consumed.

As discussed earlier, the state machine also has the ability to record the weather station information, for later use in the same stream. When we transition from one state to another, and that transition is annotated with an action box, the state machine performs the provided action. This allows us to update our internal database with the weather station details.

Note that the “Record Field Name” and “Record Field Value” boxes are fairly simple and merely save the values into local RAM. However, the “Validate and Save Record” box has the task of ensuring that all required fields (stationId, city, and units) were correctly provided, and they have meaningful values. The entire record can then be written to the database, or some other persistent storage.

Although we don’t show the second part of the state machine (where the reports section is consumed), the approach is generally the same. In our particular example, we’re not planning to store the output from reports in a database, but will instead send it downstream to some other consumer, or will perhaps discard the data after computing a running average. Therefore, the key difference in the state machine is that we only retrieve previous information from the database, not store it.

The Final Output

In order to do something useful, the state machine must contain an action to generate output. In our weather station example, we‘ll generate a stream of comma-separated values (CSV) data showing the equivalent information, but always using metric units (degrees celsius, and kilometres per hour).

Here’s the corresponding output:

1234,18.3,19.3
1362,20,23
...
1362,19,24

The actual data in the output, and the format you choose, is entirely your decision. The important fact is that we’ve processed a very large amount of JSON data on the input, without requiring that we load the entire JSON object into RAM at one time.

But, Isn’t This All Too Complicated?

Probably by now you’re wondering whether there’s a simpler solution. It feels like a lot of work to tokenize the input, then build a state machine, so why should we go to such extremes? Let’s discuss some design considerations:

1. Does Your JSON Really Need to Have Dependencies?

Our whole discussion has focused on using information from one part of the JSON message to interpret data from later parts of that same message. If you have a choice, simply avoid merging the information into the same stream in the first place. This makes parsing the data much easier.

If you don’t have a choice, read on…

2. Does the JSON Stream Have Predictable Structure?

If you do a Google search for “JSON Streaming” and throw in the name of your favourite programming language, you’ll find a bunch of libraries that address the problem in their own unique way.

Some of the advanced libraries support the JSON Path concept. That is, given a stream of JSON, and one or more path expressions, the library will extract all the JSON elements matching the specified paths. For example, we can extract all the weather station data by listening to the following two paths:

$.stations[*]   // on match, record the station details.
$.reports[*] // on match, normalize the data and output to CSV.

Note that $ is the object root, and [*] means all elements in the array.

In our example, we need a library that can listen to multiple JSON paths for the same stream, performing different actions depending on which path was matched. As an example, for JVM-based languages (Java, Scala, etc), you could try JsonSurfer.

3. What If the JSON Object Has Dynamic Structure?

Where our state machine becomes worth the investment is when the JSON schema is defined dynamically. For example, the following JSON message specifies the names and types of the data that will appear later in the stream.

{
"types": [
"name": "string",
"birth": "date",
"children": "array[string]"
],
"data": [
{ "name": "Fred", "birth": "1966/02/03", "children": [...] },
{ "name": "Mary", "birth": "1976/10/13", "children": [...] },
...
]
}

It wouldn’t be possible to construct a suitable JSON Path if we hadn’t already read the types element. Sure, we could use [*] to extract each row from data, but we’ll still need additional logic to traverse and validate the hierarchy of each sub-object, even if it’s now entirely in RAM.

Of course, building a state machine to accept a dynamically-structured JSON message isn’t easy, but that’s a topic for another day.

4. What If It’s Not JSON?

Finally, although we’ve focused extensively on JSON, the general approach of tokenizing characters, and then passing them through a state machine is just a good concept to be aware of. Many types of streaming data can be processed using this technique — it doesn’t need to be JSON. In fact, this is the exact approach used by the parsing function contained within most programming language compilers.

Summary

The use of state machines provides greater flexibility than most naive JSON streaming solutions. Those solutions can provide a stream of stations information, or reports information, but they can’t mix the two together. In our case, the structure of the JSON object can vary as we progress through the stream, with different actions being taken in each section.

Although our example was fairly simple, there are very few limits to the complexity of the JSON object we could handle, or the relationships between the various components. The only requirement is that data appears in the necessary order within the stream — that is, you can’t make use of data that hasn’t yet appeared.

Finally, this technique is fairly advanced, and you should consider carefully whether you actually need the full power of a state machine. Depending on your particular use-case, a simpler solution might be possible.