#8: Clutching at pipelines

The world of the Koalja

Published in

Aljabr

5 min readOct 10, 2018

Over the past weeks, we’ve looked at various models for data pipelines, or workflow in the form of data processing. From simple shell scripts, where workflow is commanded by edict, to makefiles and distributed convergent configurators, where workflow is steered and optimized by the guiding star of a desired end state. What we end up with is a picture of result-oriented computing at scale. This is not a new vision, per se, but everything is new again in the multi-cloud-native era. So, we’re returning to the fundamentals: how can we make pipeline processing both simple and easy in the age of Cloud? Welcome to Aljabr’s Koalja project.

Our goal of making pipelines simple points to both technical principles and a strong attention to the user experience. Let’s try to peel back some of these issues.

Utility computing

Utility computing, in all its forms, is destined to subsume all our backend systems over the next decade or two. Even computing at the edge, in our homes and environments will likely be run by cloud providers, as they merge with telecommunications giants to serve us information infrastructure as a utility. Looking ahead to that trajectory, and to make better use of the commodity service, we need a model that can scale not only computationally and geographically but also transparently. This is not without its challenges! But let’s start with some pragmatic considerations.

The diagram above shows a tiny handful of tools for data processing, with circles for specialists and boxes for integrated overlays. Existing task managers occupy different abstraction realms, and we are starting to see overlays like Apache Beam that unify the patchwork of specialities using meta managers. Koalja belongs to this meta level too, as it spans the full range of pipeline types with a data-oriented model.

What’s missing from this pantheon of tools is a simple pluggable framework that could be used to build a library of solutions for data processing, based on use case rather than on technology. TensorFlow and recent machine learning frameworks are examples of use case tailored approaches for specific cultural groups. On the other hand, we don’t want to try to replace any existing technology (even if that could bring efficiency): today, technology adoption is driven to a large extent by allegiances to signature open source projects, so we should not undermine the positions of these tools, but rather find a way to subsume and abstract them appropriately!

Data and culture driven

Pipeline consumers are data scientists, developers, and business processes following recipes. None of these have a primary interest in the technical aspects of data flow, and even less interest in the internals of the cloud (i.e. compute infrastructure primitives). A practical advantage of a meta-oriented data-driven system is that, once a tool understands the data flow, the user doesn’t need to codify resource requirements directly (as you would with task management tools) because it can largely be inferred from the data relationships themselves.

There are significant cultural differences between these groups too, especially in how users like to work on coding and data collection. A meta-pipeline can’t be too opinionated and force users to change their habits. Some users will appreciate a “serverless” / event-driven functional approach to coding, with minimal technical infrastructure awareness required, while others will prefer to package containers and take control of low-level details.

The container virtualization model for packaging and execution of software, up to and including unikernels, will almost certainly survive generational shifts for managing deployment units, so we can safely assume that a form of containerization will be a desirable feature — without mentioning any specifics. Kubernetes is now dominating the runtime management of container workloads, in a fairly agnostic way, so a good starting point for cloud data processing is to take a Kubernetes platform, with the goal of making it vanish into the walls, like a transparent underlay. Only specialists will actually have an interest in the underlay, but currently everyone is confronted with the detailed setup of all the layers, so there is a huge opportunity to cater to non-specialists.

Bread and circuitry

Data processing involves a lot of trial and error, during which it is quite helpful to maintain agility and flexibility. If we’re careful about the design, we can try to preserve the sense of prototyping through to production, using abstractions around the management of runtime environments. We can explain this using an example from electronics — which is quite similar in spirit.

Think of the following analogy: Back in the 1980s when electronics was a hobbyist pursuit, breadboards were invented to enable pluggable circuit design. No need for a soldering iron, and designing for permanence at the tinkering stage! Bread boards allowed users to plug in and rip apart. Breadboards represented the earliest kind of circuit board virtualization (see the top right figure above). It was a great idea: reversible, low risk, if not too elegant. Breadboards are still used by hobbyists to design and construct circuitry. Of course, probably no one would try to deploy such a mockup circuit “into production”, we would first turn it into a packaged, productized version, with a printed circuit board, or ultimately a VLSI chip. Now, we have languages like Verilog to assist in virtualization of chip design in code. This is all very expensive, but software is cheaper than committing to hardware.

Every early stage technology “breadboards” its way to maturity, exposing all the wiring to enthusiastic hobbyists, and celebrating its ingenuity, tweaking the system along the way. Eventually we want to make the outcome small and efficient, like VLSI. This extends all the way up to the giant chip layouts of cloud infrastructure (as in the bottom right of the figure above). Virtualization opens up the possibility of transforming mockups directly into production scale and quality, if only the underlying “cloudboard” is smart enough.

Our goal with Koalja is to create a smart platform that can adapt from prototyping to production without having to know about scale or machinery. The Kubernetes ecosystem is gradually improving to make this a plausible goal. Intermediate languages like Ko, and extensions like Knative can play a role in bridging the gap. Users will focus on their data and reasoning, and we leave the rest to Arthur C. Clarke’s third law.

Summary

Imagine a plugin model of data pipeline components, with container integration to wrap and modularize software components, and transparent integration with data sources. Some data circuit components look like transformers and diodes, while some look like service flux capacitors. Some integrated services will even have their own external service dependencies, provided by third parties. This adds to the complexity of scaling, but it should not add to the complexity for the user. It all gets quite complicated in the utility-driven future. Koalja wraps itself around these branching data paths and embraces their diversity with a friendly face.

In the next chapter, we’ll talk about how data move around a virtual circuit.