15# Beyond data pipelines

Published in

Aljabr

7 min readDec 10, 2018

Realizing Smart Workspaces

In this series of posts, we’ve described many issues that encircle data pipelines, past and present, mainly from the traditional sense of workflow management. Yet this story, even with all its historical twists and turns, is the tip of an iceberg that is currently expanding throughout businesses and social spaces. As we extend data services into every corner of our living and working environments, as sophistication in service engineering and bulk processing accumulates, so the problems of distributed data access and data processing take on an entirely new significance. We are heading from an abstract, far-away cloud towards the embedded modularization of pervasive specialized spaces — with data as the glue that holds it all together.

Pervasive ubiquity

When is a pipeline not a pipeline? When it’s plumbing. Your house, your workplace, your town, and your country have pipes that carry utility services — water, gas, sewage, electricity. These networks are partly centralized and partly decentralized for sharing common resources, but they have access points everywhere. We may start with simple pipelines, but as the plumbing expands, the scope of the pipelining transforms into something else entirely. Moreover, on top of this public plumbing there is even more specialized and private plumbing — the linkage of human process flows, with policies, barriers, access controls and adaptive responses.

It’s 20 years since the terms ubiquitous and pervasive computing were coined, around the innovative work of Xerox PARC. During that time, Information Technology infrastructure has undergone something of a revolution to open up hardware and software. Without that change, ubiquitous computing was never going to take hold. Today we’ve rebranded it, as the Internet of Things, and a few companies have taken the first baby steps. Pervasive computing didn’t pass in the night — it’s still on its way, building momentum, and navigating the changing trajectories of business. The trajectory is clear. What a few giant companies do with data today, everyone will be doing in the next decade. But today, it’s still too hard to plumb even simple systems in a scalable way. Product developers build for one use-case at a time. How about building for growth and maintainability?

Plumbing the depths and the edges

We want to instrument human processes all the way out to the edges of our involvement in the world — from distant space probes, to homes and cities, as well as from inside businesses. We might talk about the computing cloud as if it’s everything there is, but it’s not there yet. Even that pales into insignificance compared to the logistical processes that keep an economy running. The growing topic of `serverless’ computing is trying to reinvent the cloud as a single transparent multi-user time sharing system, but so far it doesn’t address practicalities of real world needs. The cloud itself is evolving — and will eventually encompass far more of those edge computers too.

Sources come from environments that equip data with both context and semantics. Those are the gold that data processing is sieving for — and they are only found at the edge. Centralized datacentres can provide brute force to dispatch certain large tasks that go beyond individual resources, but there is an abundance of wasted capacity at the edge to be reclaimed by a better, more inclusive platform. If there is one lesson we’ve learned from dataprocessing, it should be that processes benefit from being context aware — from being data aware — a complete separation of concerns leads to an inefficiency of layers. As long as there is no unified data aware network stack to plug into, IT innovation for that new era will be choked off by its own complexity. It doesn’t have to be like that.

There’s enormous scope for the routing and scheduling of `business processes’ (whether in the public or private sectors) on new multi-scale platforms. Key processes want to live in two kinds of location:

Detached, in situ, at the edge (ingress tasks):
Close to the data sources, you deploy initial logic to handle the smart sampling, selection and ingestion of data, discarding noise at the outset.
You keep the source data in a repository, with replication for disaster recovery as long as it could be needed.
Integrated, in virto, in the cloud (core tasks):
Selected data are pulled on demand into the more powerful cloud.
Process containers manipulate and transform data, and publish results as accessible, available URIs.

In between these extremity locations lies a limited network that cannot absorb every bit of data we produce (even if that were a good idea), and a storage layer that has mostly forgotten about garbage collection. I wonder how long before IT confronts its own plastic crisis — and we move from images of plastic mountains in Indochina to documentaries about choking derelict data graveyards in the cloud that consume electricity and spur on global warming by their massive energy inefficiency?

Getting DNA

Tracking what went into and what came out of a cauldron of mixed up data isn’t just a practical issue, but one of security and integrity. That’s especially true as data processing becomes integral to embedded services on all levels of society. Chains of evidence, in business and public processes are important in everything from forensic records, to blockchain transactional records. The traditional tools for monitoring individual time-sharing systems (apart from being 40 years old) are next to useless for tracing behaviours in a micro-modular cloud environment. Something better is on the horizon.

From inputs to outputs, data driven processes conceal execution traces, that are labelled by provenance, and by intent. Standing on the edge of data processes, Koalja exploits ways to distill insight from these flows and can provide actionable, semantic data for realtime and forensic analyses — a rescaling and a modernization of kernel tracing for a new generation. This is Promise Theory in action.

From the plumbing to the workspaces

The core challenges of computing are not about endless layers of APIs, or programming language choices — quite apart from the scaling of physical and virtual infrastructure — they amount to a few common issues about the smart management of space and time:

Breaking a computation into an expressive graph of stages.
Storing essential data, “close to” users with low latency and without loss of context.
Distributing data efficiently, presenting a consistent index of updates.
Accessing versioned data in a consistent state.
Tracing processes for understanding and diagnostics.

The last of these is more important than we tend to admit — today computer systems are living evolving things, not stable machinery. The goal of data processing is to understand stuff — a process involving humans as well as IT infrastructure.

Service ecosystems begin as simple processing pipelines, but that doesn’t capture their real significance. At Aljabr we are more interested in the long term trajectory of data processing, starting with the practical simplicity to bring it to a broad audience. Solving today’s problems, without addressing the underlying and upcoming challenges, would be missing an opportunity to provide stability to a generation of users.

Smart workspaces

We work together around single tasks because that is how humans organize. We already saw the shift to a tradesman model in IT with microservices allowing separation of concerns for complete lifecycle management of specializations. Heating, lighting, building engineers, (Smiths, Carpenters, Coopers, even Burgesses — pick a surname) have been doing this for centuries. Overlapping but independent concerns benefit from smart assistance, and communicate through community meeting places.

Today, we are still getting into a muddle over IT complexity, grappling with relatively new challenges of the Information Age. Tomorrow, we’ll return to the old model of human society, with modular trades, while making room for automation and even AI. The enabler for that Integration Age will be smart plumbing. The next step will be to go from smart plumbing to smart spaces, with smart virtual ecosystems, extending all the way out to the edge. The ingredients are available today from leading edge companies. The composition of those maturing ideas is Aljabr’s pipeline task.

The goal of smart infrastructure, with tasks and links, DAGs and DCGs, is not to just wire stuff together, but to act as coherent data circuitry, to be the “linker” in a multiscale compiled language — -binding jobs that are written in totally different tongues.

It’s programming…

We’ve exposed many of the basic requirements for future pipelines, and beyond, over the course of this blog, based on almost fifty years of lessons derived from history and experiences. If we forget about the transitory novelty of technologies in the news: microservices, Kubernetes, Knative, serverless, unikernels, service mesh, and so on, what it all amounts to is: how to scale a programming system from a single PC to world-spanning mega-computer — formed from many parts, partially shared by the private individuals, and changing in realtime. The principles are known and reasonably understood — so it’s time to get on and integrate them for the emergent, interoperable cloud.

This is the final blog post for this year. In the New Year, we’ll return with new posts on the future of data aware infrastructure.