Data in its natural habitat: JSON documents

Published in

techBrews

6 min readNov 17, 2019

Nostalgia: Before JSON, there was aeSchema

In 2004, I worked on a product called Rendezvous (also known as RV) at a bank in Dubai. RV was a truly unique way of distributed messaging in those days. It was a logical bus, and all applications needed to do was to send or receive data on the bus over UDP broadcast. In fact, the two tools that TIBCO continues to ship to date, with some of its software, are tibrvlisten and tibrvsend.

Rendezvous at its time was, simply put, science fiction. It blended the easiest set up in the world with low latency and a beautiful piece of architectural prowess that married TCP and UDP. It was super simple to troubleshoot (listening on a wildcard subject all across the subnet was all you had to do), and you could see every message in an out; a debugger’s dream and a security engineer’s nightmare.

The output of a tibrvlisten typically is a message that’s being sent “over the bus”, was also known as an ae message (ae = ActiveEnterprise) and it looks very similar to this.

2019–15–11 10:54:59 (2019–15–11 10:54:59.520000000Z): subject=hello, message={DATA=”hey world” value=17.10}

Looks familiar? Here was something that pre-dates JSON: aeSchema. It was beautiful: key-value pairs with no fuss. It had basic data types although we largely dealt with strings, it had sequences and repeating structures, and it was easily parsable. Nineteen years later, we’re back to where we began. JSON remains one of the easiest way to represent data.

But how about data that is not in-flight; is JSON the right choice of tech to store all data at rest? When we look at managing data, we typically consider one or more of three use-cases.

Ability to search through a large data set

These are classic big-data use-cases that involve data exploration, full-text searches in both, structured data and unstructured data such as PDF/word attachments. The industry outcomes that organizations expect out of being able to search large data sets involve scenarios like negative-list checks, dedupes and customer profiling for risk (fraud) and opportunities (cross-sell/up-sell). Hadoop and Elastic are relatively popular technologies that address this problem really well.

Ability to report on the data set

As organizations tend to become more data-driven, the importance of spinning up visualizations and dashboards that aggregate data cannot be understated, particularly for consumption by thought leaders, executives and decisions makers. The more popular BI tools in the market — Tableau, Qlik, Spotfire, PowerBI — all rely on some form of relationships existing between data sets. Most of these tools were built to work with RDBMSes and now support document data stores, either natively or via well-established connectors.

Ability to serve the data as APIs

This is one aspect of the data that is closer to heart and often ignored. Let’s consider a simple example to first put this into context and understand why is it important.

A large bank deals with hundreds and thousands of transactions — POS swipes, internet banking transfers, UPI payments — every day. One of their data science projects is to build models that classify and cluster similar transactions for a given human together, to determine what’s common amongst their payments, and predict if the person is likely to spend on food, travel or tech gadgets. This requires a very strong analytical engine that can sift through thousands of records of data and make a prediction, not necessarily in realtime, but at regular intervals (such as every 24 hours).

The outcome of this prediction is a probability score that is served on a visualization dashboard. Traditionally, we would extract that data, create whitelists and campaigns around it and target customers with contextual offers that are relevant to them, to nudge them towards a store or increase their spend on the bank’s co-branded card.

Today, a lot of marketing tends to be done when the customer makes contact with the bank. A phone call into the IVR drops into the call center team, a user logs into the internet banking after a few weeks, or a customer walks into a shop and makes a purchase. These are the best moments to recommend a contextual offer for the customer to make a decision; at that point in time, they’re hooked.

Calculating that probability is important, but what’s equally important is to be able to serve that data as quickly as possible via an API and to be able to handle hundreds and thousands of hits per second.

This is where JSON rules.

All hail the Omnipresent Javascript

JS is a real langugage and that’s made it a key factor in the adoption of JSON. Coupled with TypeScript, it is ubiqutious; from Angular and React to NodeJS/Express and MongoDB; JSON/BSON documents are everywhere.

The beautify of it is in its versatility. Here are a few reasons I would pick JSON documents to store data.

1. Key-value pairs are way versatile than rigid columns and tuples.

For the same data set (aka collection), it is perfectly alright to have different records having different attributes. Not all employees will have an exitDate, not all accounts will have a closedDate, and not all payments will have a return element. I won’t end up having to design each and every attribute for each and every scenario up front, I don’t have to fire a shitload of ALTER TABLE statements every time I come up with a new attribute specific to a use-case, and I don’t have to stare at a hundred null values in a row.

2. It makes data seem more natural.

I have never enjoyed joins. For decades, an e-commerce order that has multiple line items would have to be thought of in a one-to-many relationship. It seems completely unnatural, why would I need two tables to manage and represent one dataset? It discourages “filler” attributes, and schemas that have columns such as “Address1”, “Address2”, etc. JSON documents simplify it; one record (aka document) = one e-Commerce order. Arrays to represent multi-value datasets. Nested structures to indicate grouping.

3. The overhead of APIfication is less.

While reporting is slow, I believe the performance overhead on reports is acceptable given that a few humans are requesting aggregated data over large sets on-demand. APIs, on the other hand, are built to scale, so fetching data from a database that stores it in BSON, parsing it in NodeJS / Express in JSON and then passing it back to the UI that renders it using JSON / HTML is the norm. The overheads involved in doing this serialization is almost nothing.

A comparison of how an e-Commerce order is stored in an RDBMS (on the left) vs a MongoDB BSON structure visualized as a JSON document (on the right)

To summarise, JSON and the Document data model remains my most-preferred way of storing data at the moment. It certainly isn’t a one-size fits all, but it certainly fits most use-cases.

Disclaimer: The views in this post are personal, yet many of these principles were applied to design our open-source CRUDder that ultimately paved the way for us at CAPIOT Software to build two products we are extremely proud of, that are only getting started in market adoption. XCRO is the only enterprise-grade tool out there that manages complex escrow deals from constructing highways to producing movies and high throughput trust-retention constructs such as dealing with e-Commerce transactions. Omni Data Platform is an API-first enterprise data management platform deployed on a containerized architecture. Both tools are built on top of a world class database — MongoDB — with JSON documents at its heart. MongoDB, after providing ACID transactions in version 4+, is now starting to leverage Lucene to power its full text search, and there are several BI connectors out there that make it easy to report on the document database, enabling us to solve a wide variety of use-cases with a single design that can be configured for different flavours of data.