There Is No ‘Data’

Hayo van Loon
incentro
Published in
7 min readJan 26, 2021

In the early 90’s, corporate IT was invaded by a new tribe: the business intelligence people. Seeking to satisfy their hunger for ‘data’, their tendrils crept deep into existing systems. This often put them at odds with the existing IT teams that had the task of ‘keeping the main show running’.

Sadly, decades later, this still describes the position of ‘data’ workers in many organisations; invaders in IT. But what if we could start anew? With all our knowledge and experience building IT solutions; with all the currently available (cloud-based) technology? How would we approach this ‘data’? Well first, we would study it a bit.

We can quickly conclude that data in general have no shape or form. That is a bit of a dead end. So we shift our attention to how it comes into being and how it disappears. In other words, we will start investigating its life cycle and see where that gets us.

The Circle of Life

Compared to an organism’s, the life cycle of a piece of data is actually pretty simple. Right of the bat, we can exclude a few phases. Data do not die; they just go out of (your) field of view. This is erring on the side of caution. It is much safer than assuming they conveniently disappear. Social media are rife with awkward posts waiting to come back to haunt their authors.

Aeons later, even the dead can still tell stories (at the LaBrea Tar Pits for instance).

Data also do not age, go stale or decay. A newspaper issue might be worthless to you in a week’s time, but a historian might love it a century from now. The context determines its value. Personally though, unless I plan on painting the house, my newspapers end up in the paper bin.

With those out of the way, conveniently few stages remain. Following some business event, a piece of data is produced. Then after some time, this piece of information may or may not be distributed. And since data are inanimate entities, nothing else will happen without some external force.

We can also pinpoint where this all happens: inside every single application. So if you are in need of some kind of data and you know your application landscape, where would you start looking? The answer should be evident:

The most natural source for data is the application that produced it.

Though attractively simple, building upon this two-phase cycle ought to make you feel a wee bit uncomfortable. If nothing happens to the data between production and distribution, sensitive parts will ultimately end up in the wrong places. Even when only distributing to trusted parties, this is a considerable risk.

Fortunately, the solution is rather simple. Prior to distribution, the pieces of data have to be prepared. You redact (delete, encrypt, hash, …) things like passwords, credit card numbers and personally identifiable information. This phase can also be used to improve data usability by adding extra information. I found that slapping a creation timestamp on every piece often helped solve or prevent all sorts of issues further down the line.

The actions taken in the preparation phase, combined with a well-maintained distribution list, should keep things on the safe side, legally and otherwise. Let us shift our focus to the distribution a bit more.

When you think of distribution, you think of an active process; for instance newspapers that get delivered to your doorstep. So if you want promote usage and experimentation with data, why not design a flexible system for distributing it? Fortunately, we can adopt a solution from the real world: subscriptions. We will make interested parties sign up for subscriptions at the source application. From there, the sign-ups can be judged by its application managers — they ought to know best who to allow access.

You could also assign roles to (groups of) subscribers. These roles can be used to tailor the data preparation in the preparation phase. And since the sign-up process is awfully generic, we can (and should) also standardise it across our application landscape for both greater efficiency and ease of use.

An application that pushes data as a side-effect of handling ‘normal’ API calls

Pushing for Simplicity

But why do we go for a newspaper-like subscription model? Why not a library subscription model where subscribers can walk in and browse at their leisure? Isn’t that easier to implement?

Pushing data might sound like a lot of work (it’s not), but it is essential in simplifying the entire system. It allows us to introduce the following rule:

Once data has been pushed into a subscriber-provided channel, it is no longer the responsibility of the producer.

A channel is a general term for both streaming and batch communication channels (message queues, file dumps). If we then add that (allowing access to) the receiving channel is the responsibility of the subscriber, the entire data flow is covered.

The main reason for this exercise is to free producer applications from any (external) burdens tied to historical data, like random query access. Once the data have been pushed, it’s done (produce, push, done). It brings the entire process more in line with a normal web API flow (request, reply, done). In fact, it blends in rather well with other API specifications:

// Fetch a single book by name
GET /v1/books/<name>
// Fetch a list of books
GET /v1/books
// Receive updates on books
POST /v1/books:subscribe

Like normal API calls, changing requirements might force you to create a new version at some point. This too follows an age-old pattern: develop (beta), use, deprecate, drop.

Tar pits are lakes bubbling with history. But they are not necessarily an asset to every garden.

History? Dashboards? Analytics?

But what about all that ‘precious’ historical data? Well, that is up to the parties that actually see value in it. The principle is once again simple: “You want it? You pay for it.”. Since the value of data is context-dependent, it is hardly fair to burden someone who is not necessarily interested (the producer) with the costs. To be clear: these are hardly for storage, but rather for query access and maintenance.

So if someone is interested in historical data, they can subscribe to its feed. They can then transform its output at their leisure before storing it as history. The need for ad hoc historical data access could also be pooled. Typical use cases would be analytics and mitigating cold-start in machine learning. And of course, access to historical data is pretty much a must-have when a data subscriber has to recover from an outage in its incoming data channels.

For all this, you would just create a special application. It would be subscribed to all the relevant feeds to build histories for each. Perhaps ‘data warehouse’ would be a good name for it. Maybe do a few extra transformations and plug your dashboarding and analytics tools somewhere into it as well. Important caveat: the data in the warehouse must have the same access rules as their source applications, lest you risk a data leak. Use with caution.

Take-away time!

Synthesis

And with that, we should have ticked all major boxes on ‘data’. From this point of view, there really should not be that much special about it. In part it is just another method for accessing ‘normal’ application data. And to make it so, modern tools in the form of managed cloud services (like Google Pubsub, AWS SQS, …) can greatly reduce development time and effort.

Of course, unless you have a greenfield situation, you will be haunted by the ghosts of IT past; legacy systems, vendor software with poor data interfaces and so on. But these should be addressed in steps on the path towards, rather than compromises on, the synthesis of ‘data’ and IT. Defining a strategy is way beyond the scope of this post, but could contain terms like: shared principles and rules, developer autonomy and decentralisation.

If you really want to make use of ‘data’, there should ultimately be no ‘data’. Either you fully integrate it into everyday IT, or you perpetuate a three decade old workaround. Your choice.

PS. Before We Forget…

Earlier I mentioned that data do not die. How does this align with the ‘right to be forgotten’, which has been anchored into legislation in many countries? First of all, the easiest way to comply is to redact (hash, encrypt, …) sensitive elements of data prior to distribution. But sometimes this is just not feasible or sufficient. So how to handle a request to be forgotten? The data subscription system will help us. We will start with purging the offending data at the source application. Then we go through its subscription list, looking for subscribers that might have received the data and repeat the purge there. A process we repeat as long as needed.

--

--

Hayo van Loon
incentro

Cloud Architect, Developer and Climber. Never stop coding.