#10: Swallow the spider to catch the fly

When infrastructure gets in the way

Published in

Aljabr

6 min readOct 30, 2018

The first commandment of data is: Let not infrastructure get out of hand and dominate your data process! You don’t want to swallow an entire horse just to catch a small fly, by starting an entire multi-server cloud just to process a few numbers. At the same time, in any data project, you likely need to think about future scaling, to progress to the ability to ask and answer questions at scale. These are the promises of serverless computing — -but in a more general sense than is sometime implied. However processes are defined, there has to be flexibility and scalable modularity. In this post, we look at how Koalja uses Kubernetes and KoCircuit to keep a lid on complexity.

“…We need a standard model for shrink-wrapping data processing with pipelines…and the solution should not be tied to a particular language.”

“Sometimes you want to work on your laptop — but for the big stuff you need more power. Using cloud is non-trivial, and a headache to even think about — what to do? Although we live in an age where developers have begun to develop primarily for other developers, we can’t forget about the ordinary experience of mere mortal users. What we need is a standard model for shrink-wrapping data processing with pipelines that can address these issues. And the solution should not be tied to a particular language.

There are excellent frameworks, e.g. Akka, TensorFlow, and so on, that are designed for particular languages and for particular tasks, but we want to think beyond these horizons…

Kubernetes is perhaps the first cloud platform to be thought through as a distributed system, rather than as a collection of useful components. Using containers, it retains the pluggability of a component system, but adds features like namespaces and coordination layers. It can be the basis of a “server free” abstraction, or a point to point chain of services. Like many systems that came before it, its idea to replace free imperative control with declarative composition falls short of reality at some point. Complexity turns declarations into a structureless nightmare — death by a thousand YAML files. Simplicity multiplied by complexity equals complexity, unless you can renormalize it away with new abstractions.

Canute on the beach, when the walls fell

A lot of coordination is needed to run things on a distributed system — even for simple cases. Luckily, today, Kubernetes is available on laptops (minikube/kubeadm) and in the cloud as a service, containing a lot of tools for this environmental purpose.

Suppose you want to command waves of data flooding in, like King Canute ordering back the tides — then capture them in “a pipe”, and irrigate your services with CSV goodness. Say you want to feed it first into one filter to clean up the noisy CSV data, and then pipe them into another stage that adds up the totals, like the sequence in the figure below.

The implementation of this pipeline, in Kubernetes is through declarations in YAML. Even for this simplest of cases, the manual incantations needed to set up a Kubernetes cluster might look as complicated as this:

That’s quite a mouthful, and not one easily spoken by a king. It could be acceptable as an internal representation, intended for developers, as an intermediate stage in a hidden execution, but it’s not a desirable working format for data scientists. No one even wants to type this in. It’s error prone and hard to see the wood for the trees. Let’s face it — even developers would balk at something much more complex than this, and it gets rapidly more complex as we add parts.

Choosing the right concepts

Kubernetes was designed as a simple resource layer, not as a programming coordination layer. It was designed initially to support a model of stateless services, running on top of shared storage, for the web. That’s a very popular model for web processing, but not all cloud workloads are well served by this model. It doesn’t offer much in the way of abstractions on top of this. As with nearly all declarative systems, it leaves too much of the story to additional scripting, leading to a mixed model.

To add more types of workload, in Kubernetes, one has to create a Custom Resource Definition (CRD), which then integrates with the general API server mechanisms and APIs. Many technology projects now offer CRDs for Kubernetes to make the deployment of their products and services easier.

Aljabr’s goal is to elevate the user experience above these details. Our assembler language KoCircuit strips away control flow aspects of pipelines and interfaces to Kubernetes, reducing complexity on a low level. Our upcoming pipeline service Koalja extends out-of-the-box algorithms provided by Kubernetes, in a smart way, to keep the tangible benefits of the stateless model (such as auto-elastic replication, boundary management, and self-healing) without limiting users to a stateless processing model. Koalja implements its features as CRDs and operators within the Kubernetes ecosystem, but does this transparently.

A CRD (Custom Resource Definition) is an interface that allows Kubernetes developers a way to make custom controllers. Not every operation falls into the basic categories if deploy, test, scale, etc. By creating new CRDs, developers can create complex imperative processes, while presenting the user with only simple declarative parameters.

An Operator is an application specific custom controller. It allows basic Kubernetes functionality to be tailored and extended for the specific domain knowledge about an application.

Kubernetes has internal operators for ReplicaSets DaemonSets and more.

Smarter linkage

A simple way to improve on the abstractions of a resource layer is to address processes as simple graphs. In our current proto-cloud age, the tendency is for developers to throw piles of half-finished complexity at users “take it or leave it”. Their abstractions are often designed for their own specific needs and internal processes. Maturity in cloud systems can only come by abstracting away these quirks, and eliminating the details of implementations as far as possible. At Aljabr, we believe that Smart DAGs (formed from Smart Tasks and Smart Links) are the answer to this commoditization.

Passing data around sounds like a straightforward enough task, but developers are not usually thinking about the underlying resource bottlenecks. That ought to be taken care of by the platform, but current platforms can’t do this because they don’t have sufficient insight into their payloads.

Smart links speak many tongues: Intermediate data may be stored in a variety of structured or unstructured formats, concealing a variety of technologies for “publishing” and “subscribing” to stashes and flows, each with policy based access rights:

File stream semantics — READ, WRITE, APPEND
Query semantics — INSERT INTO, SELECT FROM

Smart linkage can help to smooth over these differences for users, and handle the potential resource bottlenecks implicit in each. We’ll return to this later in the series.

Data intensive processing is, by implication, a network intensive process — -even more than it is a CPU intensive process. This is where a smart platform can help. Forget about the underlying structure of microservices, pods, deployments, and service mesh. Those are issues for a platform to virtualize away, not expose to the user.

Smart links can handle sampling and scaling issues, as well as security and integrity boundaries, and the coordination between different stages. In Koalja, using a smart link abstraction allows us to abstract away a lot of issues, including: instrumentation and tracing (process observability), auto-build, artifact caching, forensic reanalysis, and data consensus coordination between parallel tasks. We are going to be unpacking these ideas in this blog over the coming weeks.

Platforms all the way down

Building abstractions on top of Kubernetes is a path to multi-cloud processing that offers a number of advantages over today’s Do-It-Yourself coding. If we can reduce interactions between services and tasks to the level of drawing a labelled graph, then managing real world processes will finally come of age. Users should not live in fear that their code will not run in a few years time due to changes in infrastructure and its API layers. Smart tasks and links that are aware of the kind of Wide Area Distributed Environment they must inhabit in our cloud future can shift the paradigm towards a standard model for data processing that will outlast technology shifts, and deliver cloud as the utility it need to become.