CodeX
Published in

CodeX

Solving the Data Problem with Context Linking

Photo by Shahadat Rahman on Unsplash

We have a problem with data. Without context, data is just bits and bytes.

This article explores a novel paradigm for tracking context and lineage alongside derived data.

It proposes the use of block chain technology, similar to the approach adopted for Non-Fungible Tokens, as a solution to the problem of how to track unconnected context and domain expertise to any dataset.

What is data?

Have you ever really thought about that? I mean in a philosophical ‘why-is-it-there’ kind of way. What really is data?

Everything our senses perceive is data, though its storage in our cranial wet stuff leaves something to be desired. Writing it down is a bit more reliable, especially when we write it down on a computer. When those notes are well-organized, we call them dataCassie Kosyrkov: Understanding Data

The challenge is that data can mean almost anything to anyone, especially when anyone can write it down.

Derived data

As soon as we apply a human interpretation by writing data down, or extracting it from its originating source, we have applied a derivation to data. Derived data here is anything other than the actual originating physical instance of each datum in its very earliest form.

In other words: almost all data in existence is derived data!

… and almost all derived data was recorded by someone other than you.

Other People’s Data is Inherently Untrustworthy

The problem with other people’s derived data is that before we can know that it is a faithful and useful derivation from its origin(s) we must understand its context, and validate all of the derivations that have been applied across its lineage.

NB: Even our own data may carry unacknowledged (and unknowable) personal bias.

Context Linking Tokens

Context Linking Tokens (or CLTs) are individual ledgers for derived datasets. They immutably track all of the derivation steps executed to derive a given dataset from all of its originating sources.

Author’s own work

It is anticipated that Context Linking Tokens will be applied in such wide-ranging fields as solving the problem of managing data architectures at enterprise scale on the one side, all the way to tracking (and potentially monetizing) personal data within online platforms on the other.

  • The concept may even enable marketplaces for expertise, where experts are can ‘mint’ their insights and trade on the value they generate.

Please also see my YouTube video of this post.

The Data Architecture Challenge

Creating useable, sustainable data architectures at any scale is difficult (and expensive). It has always been so. This is because every piece of original data can be abstracted so easily into many, and many-varied forms.

The utility of data is a function of its form … but data form is determined by context. The format and structure we decided to use to write the data down was influenced by how the data was originally created, and every intervening derivation that may have been applied.

In other words the value of data, its utility, is determined by all of its originating context.

However it is not possible to physically link data to its context.

So, the utility of derived data is a function of its context … but that context is unconnected, and ‘unconnectable’, to the data.

We have a problem.

Given that the majority of derived data was created by other people, and that we have no way of linking its originating context to any of its derivations, it is very challenging to design any kind of common approach to data management or to construct a common data architecture.

Instead we end up solving the easier but different problem of how to most effectively quality assure other people’s data.

Centralize it!

Today enterprises do this by centralizing derivation activities within their centralized IT function, recording the context (as ETL logic) as they re-derive the data from its originating sources.

This is done using single-purpose systems (like spreadsheets, or even simple text files), monolithic ERP systems, enterprise data warehouses or even data catalogues.

But this approach is slow, inflexible and expensive … especially at scale.

Data Lakes

We thought we had solved the data problem by building Data Lakes and filling them with raw data, but that just avoided the QA problem altogether, rather than solving it.

After all “other people’s data is inherently untrustworthy

… and now Data Lakes are a particular problem in data architectures: they account for much, much more data with even less context.

Evermore Data Centric

As a species we are evermore data-centric. The rewards are just too great to ignore. Machine intelligence is magic!

“Any sufficiently advanced technology is indistinguishable from magic” Arthur C. Clarke

But machine intelligence is entirely dependent on derived data, and as we know even personally-derived data may carry unacknowledged bias.

Baked-in bias is a significant concern for machine learned algorithms.

Data Mesh

The Data Mesh approach tries to solve the problem by holding ‘the other people’ responsible for proving the trustworthiness of their data. Data Products are deemed to be the responsibility of the teams who curate individual data derivations.

But how far do you go … its turtles all the way down (and up).

And what about data mash-ups? After all inspiration, the analysts’ ‘secret sauce’, lies in bringing data from different contexts together to uncover new relationships.

So what are data-centric individuals and enterprises to do?

Categorizing Data Might Help

Separating data into 2 distinct categories is a start:

  1. Original data: This is raw data, in the exact form it was created, it is the first genesis of the data. It can be machine-generated, or human created (a ‘to do’ list on a napkin). It can be stored digitally, but not necessarily so.
  2. Derived data: Any data that is not original … in other words wholly derived from original data or from other derived data, or a combination of the two. But remember derived data is inherently untrustworthy (especially other people’s derived data).

New terminology

We have coined the following two new terms. Differentiating the specific CLT use of these terms from their generic use will hopefully reduce confusion.

  • Dadiv” — referring to any derived dataset, or any individual element of a derived dataset
  • Dariginal” — referring to an original dataset, or any individual element of an original dataset

Solving the Problem

But how exactly can we solve the problem of validating data at scale and attaching context in a way that is immutable?

Enter Blockchain & NFTs

  • Block chain technology shifts the protection of trust from teams of people and centralized institutions to technology –> distributed, decentralized ledgers of encrypted records that are effectively immutable.
  • NFTs — Non-fungible Tokens are an application of block chains to immutably track ownership in unique digital assets.
https://www.theverge.com/2021/3/11/22325054/beeple-christies-nft-sale-cost-everydays-69-million

CLTs — What They Are

What if we used block chain technology to create a ledger of all the derivation steps used to derive a unique ‘dadiv’ from its founding sources.

In other words … can we apply the emerging NFT concepts (tracking ownership in unique digital assets) to tracking all of the steps applied to derive data, including all intervening contexts?

If we can apply these technologies to dadivs then we not only link any derived dataset to its related contexts. We also create the potential to recognise the contribution of all actors along the value chain of each unique derivation.

This could even be the basis of a ‘marketplace of expertise’, with each contributor reserving rights to their individual component of insight, tracked as a distinct step in the derivation lineage.

We could also establish distributed trust in communication channels, such as news articles — with the full, immutable lineage of every article being publically visible, and all actors accountable for their component contributions

CLTs or Context Linking Tokens are a proposed adaptation of the Non-Fungible Token concept to address the challenges of managing data architectures at scale.

If we accept that all personal data could be treated as a derived dataset and enable it to be minted automatically then ‘personal data’ CLTs could be used to protect ownership of personal datasets, requiring a formal license to use.

The CLT Process

  1. The initial step is to ascribe ownership to individual derived datasets. This is the ‘minting’ process, and it facilitates the creation of single or multiple copies of individual derived datasets. It may be that I have a valuable insight that I want to share and realize a return on, and I may want to do that many times over. No problemo. I simply mint as many copies of my derived dataset as I think makes the most economic sense.
  2. Registering Ownership. Public and private registers of dadivs can be maintained, along with all dariginals. These provide an important reference point for CLT ledgers, and the choice to create a private or public register allows us to manage scale and complexity.
  3. Individual CLT ledgers are created for a given dadiv and track all of its founding sources, as well as every derivation step, along with any additional and relevant context.

Conclusion

We know we have a problem with managing derived data. It is manifest in the challenges we face when trying to manage ‘other people’s data’, especially at scale.

We have tried to solve this problem by either:

  • Re-performing the derivation steps centrally, but this doesn’t scale well, and recently data volume, complexity and consumer expectation have all exploded. Central teams simply can’t keep pace with the fire-hose supply or the changing-at-the-speed-of-light demand; or
  • Forcing ‘the others’ to own the problem, but this has not yet been proven to be practical.

The evolution of distributed ledger technology offers a new option for linking context to derived datasets, via tokens.

We propose that it will be possible to track context along the derivation lineage of derived datasets through Context Linking Tokens. Doing so will enable us to track ownership, accountability for accuracy and fitness for purpose of all interim stages, and to track these immutably.

We present Context Linking Tokens.

--

--

--

Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

Trust the process

R Script in Power BI and an Application

How is Big Data Changing the Finance Industry?

What Caused Powerball to Go Viral or Good Grief! Is that Data Leakage?

My Journey to Data Science — Part II

Adjusting player metrics based on team strength

“Has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an…

Customer segmentation — Part II

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Andrew Jabbitt

Andrew Jabbitt

More from Medium

Building an ML Software Ecosystem for Boosting tiket.com User Experience

Code Quality Analysis

Synapse setup PowerShell

Meet Mario López, Data Engineer at Mercadona Tech