#9: Data, data, everywhere…

And Smart Links in between

Published in

Aljabr

8 min readOct 15, 2018

In the last chapter, we talked about simple process abstractions for prototyping in cloud pipelines, but we didn’t say too much about how data move around. This is one of the most important aspects of data processing, as the volume of data is generally far greater than code required to process it. In post #7, we remarked that processing by batch versus stream is just a high level version of the queue scheduling issue that led to packet networks and TCP/IP. Even in a continuous stream, you need to transmit packet by packet to remain agile and to recover from errors. The question is: what are the right abstractions to make these issues transparent for users?

Drink from the source

Data sources are basically generalized sensors, attached to some real world process (see figure below). It might be accounting records, the Internet of Things (smart homes and factory systems, delivery trucks for logistics, or telemetry from aircraft in flight), it might be public Smart City systems, or just users clicking away about their preferences on e-commerce websites. Some data start life far away from the point of analysis, while some are integrated into business processes, already in the cloud.

Data sources act as pseudo-services, some local and some remote. In practice, the delivery mechanism for data might be represented in a number of ways — binary or ASCII text, stored as simple file objects, assembled from separate queries to online servers, data buffered through a Kafka stream, or even pushed into multiple queues for batch processing. A file stream may be a pipe, and a service might be a sensor, but in programming terms, the service model forms the bulk of the normal cases. Push and pull semantics can be added as flavors to each, where they make sense in different cases.

Data updates are sometimes fed into sensory inputs manually, like a batch of coffee beans poured into a grinder, and sometimes they arrive of their own volition. Sometimes they are fetched explicitly, like fetching water from a well. Some tools assume that there is only one appropriate model for input and output, as they are designed for a specific narrow case, and this can cause consternation for users who have designed their thinking around their own constraints. Aljabr’s Koalja embraces these variations without prejudice.

Chopping it up

How users choose to process data is going to depend on their internal constraints. One of the key issues that’s difficult for users to think about it how to handle the different timescales for inputs. The time between updates, relative to the processing rate and capacity, is key to scaling the processing. Researchers might have the luxury of time freedom, but in an industrialized setting, the processing software, however it may be written, needs to packetize data into batches, and process them with a certain Service Level Objective (SLO). Whether we deal with file objects or a stream abstraction shouldn’t matter, or require different tooling or infrastructure. What we need is an abstraction that works for all these different cases — instead of building a different technology for each case.

Think of the following example: new data arrive from a website analytical engine every few microseconds, but we decide to avoid user disruption and undue cost by processing updates only every hour. Processing an update involves a chain of processes to train a machine learning network. The processing time take 3 hours to complete. Software updates that improve the machine learning network’s algorithms change every few weeks, but may involve updating all the affected artifacts to release a new version of the online service. What do all these timescales mean for the data pipeline in continuous operation?

Sketching the DAG

A data processing DAG, like any kind of flow circuit, takes time to sketch out, develop, and debug. Some parts of the circuit will be of our own making, others will be exterior off-the-shelf services and sources that we want to integrate. A mixture of strategy and technical architecture is involved, as in any programming task. As this becomes a more common business task, users will want to do this flexibly — on their phones, pads, or laptops, lying under a coconut tree at the beach, instead of being tied to a data center terminal. They will want access to the real data, wherever they may be, rather than spending time to fake a small amount of data for testing. They won’t want to have to set up a new platform to do that. They won’t even want to know if there is a platform.

Today, most pipeline technologies are written for developers or engineers. The assumption is that users will not only be adept at cloud development (and in a variety of languages), but that they will enjoy learning new platform details and becoming part of a new tribe dancing around a git repository. Software is mainly being written for that persona, instead of for business users or data scientists. Users are thrown into the deep end, having to not only think about the reasoning behind their business process, but also confronted by a dearth of technological decisions: which database, which programming language, where to hire expert help. Users are expected to keep track of reams of configuration settings, build unfamiliar intermediary objects like containers, submit them for deployment, map port number and IP addresses, set up proxies. The Internet has become a `kit car’ on steroids rather than a self-driving delivery truck — not really suitable for global logistics without a team of traveling mechanics.

Making the drop off

Underlying the difficulties of data processing is a growing burden of technological overhead. Modern cloud computing has too many moving parts. We think of Mark Weiser’s observation about mature technologies being those that disappear into the walls, and ask: how can we make the technology disappear for ordinary users, without compromising their ability to customize and optimize their processing needs?

The essence of automation is to let the machinery do the heavy lifting, and let the humans do the thinking. Today, the heavy lifting is not as heavy, but there are many more layers of it! This is not a problem that is particular to data pipelines, but rather to situations where collaborative processes exchange data. In our minds, DAGs might look like flowcharts or microservices, but in practice they are far more intricate.

Let’s think about the ideal user experience for the average user. Data scientists like to think about the flow of reasoning using a mixture of ad hoc manual experiments, combined in a series of simple interactive steps, and eventually scripted to document the steps. They might start with single step transformations:

Run program_A < input > output
Run program_B | filter

Then they might progress to a standard set of steps that run as a service, allowing them to drop an update into a queue, where it will be picked up and processed automatically.

Drop <fileref> at <input ref>
Drop <directory> at <input ref> order bydate/random

A developer might do the same, but using a version control system to `commit` a change, instead of dropping a file. Both personas might want to have the freedom and safety to play around with data before pushing a button to commit to authorizing the next stage of a computation.

Databases are also files — structured ones. From a database perspective, a file drop is an “INSERT INTO” operation, while a sampling of data is a “SELECT FROM” operation. A database can be a partially ordered queue in between these two.

Smart links

In principle, the experience sounds easy, but, of course, things that add layers of headaches are:

The data formatting or type schema interpretation between stages
The proxy for intermediate buffering between stages

The ingenious idea of pipes in Unix made simple joins easy for text streams, but in a modern environment, we need pipes to have matching multivariate data semantics, different for each link. We also need to handle privacy, access controls, and buffer differences in timescales on the input and output, i.e. to handle batching of data, we need smart storage to be embedded within the links. That might involve a message bus, a file, or a database from one of the multitude of possible flavors and variants available (TensorFlow, Kafka, Spark, Heron, ONNX, PyTorch, etc.), with implicit semantics. Users want to focus on the data semantics, not on setting up the technology, installing endless database connector APIs, and then go through dependency hell to make the versions compatible.

In Koalja, we provide the support for smart links between smart tasks, to handle a number of policy decisions and abstract away contextually irrelevant details and highlight priorities. Suppose you could just say:

Konnect stage1 stage2a with link2a
Konnect stage1 stage2b with link2b

(see figure below)

By wrapping user transformations (tasks) in smart connectors, we can follow a long tradition of modular pipeline technology used in recording studios and performance software, from patchboards to mixing desks — like an adaptive breadboard for data processing. Imagine that each “smart link” has an adaptive “volume control” setting the sampling rate for data to a maximum level to limit the computational resources invoked by the pipeline. A coordinator can now elastically scale from a wireframe test configuration to full production density by adjusting these settings, as well as change other policy settings. Additional sliders may be used to control other aspects of smart behavior.

We can try to preserve the simplicity of something like a Node Red interface — but as users advance in their sophistication, they will want more control, perhaps through a language interface. The Internet of Things makes a fitting and highly relevant example of some other aspects of data processing, too. Smart houses are easy, but smart cities involve the collaboration of multiple jurisdictional workspaces. Just how smart do we want a process to be on our behalf? What aspects of setting up and running can be absorbed into smartness? These are key questions that we’ll delve into in the coming weeks.

Even before Kubernetes existed, we speculated on the idea of Smart Workspaces for spanning distributed environments for the Internet of Things, with role based access layers and transparent traceability — from computing at the edge of the network, in vehicles and homes. Kubernetes is a platform that could make that vision a reality. If we solve this basic problem, the applications are truly limitless.

Computing from edge to center

When dealing with large processing volumes, the principle is to move computation to where the data are, not the other way around. The less we have to transmit, the better. Computers are everywhere, so it makes sense for the first stages of pipelines to be in situ, in the field, not in a backend data center / “on-prem” cloud (i.e. EC2). There is then scope for filtering and security-privacy policy.

Koalja is a vision for a smart data processing platform — a practical way of breadboarding and wireframing data circuitry for engineers and data scientists, involved in a distributed collaboration, where all the technology is concealed in the walls.

In the next chapters, we’ll unravel what it means for links and tasks to be smart.