#6: Steering the cloud simply and predictably

Enter Kubernetes and the container cloud

Aljabr, Inc.
Aljabr
7 min readSep 24, 2018

--

With the age of bare-metal computing waning (for most), process control is being reinvented behind scaled cloud abstractions. Users no longer need to think about the machinery of infrastructure, only about the workflows that run on top. This presents a big opportunity to bring simplicity to a non-technical audience, but we are some way off getting there yet. Today’s systems are too complex and fragile to misuse. In this post, we move on from reviewing the past and dip our toe into the present to ask: what does process management look like now? Is it yet suitable for bringing workflow pipelining to the masses?

Kubernetes represents probably the first of a generation of worthy abstractions for versioned work scheduling in the cloud. Kubernetes turns clusters of containers back into something resembling a single machine (albeit a multithreaded one). Kubernetes has a vast number of distracting APIs that can be scripted to perform calculations, administrative tasks, and form execution pipelines, but it is manifestly low level — and it was expressly designed for and by cloud “DevOps” engineers, not developer end-users. From a process safety perspective, interestingly, Kubernetes has inherited some of the concepts of desired-end-state computation and self-healing, making it like a faster CFEngine but optimized for container execution… So, where does that leave us?

For data processing, having a platform like Kubernetes is a great starting point. It can bring business stability to processes, but it offers none of the semantics or simplicity that naturally lend themselves to pipeline processing. The lessons of CFEngine and Make showed that describing workflows safely in a distributed system, without a bigger abstraction to draw on, can be extremely cumbersome and unwieldy. This has led most users to give up on that approach and many reverted to manual orchestration methods using runbooks and remote shell invocations that allowed them to think step-by-step and procedurally. But that throws out the baby with the bath water. Automation is an important element in scalable predictability.

Humans vs. robots in production

The more mechanical and predictable a process needs to be, the more it benefits from automation. This is a key strategy in quality assurance. A desired end-state model adds a notion of safety and industrial assurance. Reintroducing humans a manual controller in the coordination and execution of a scaled pipeline is not really a sustainable approach, but it is still a very popular one. When processes run quickly, human involvement becomes a source of random error, and perhaps even an accident waiting to happen. This is an argument for tools like Kubernetes, but some forget to also apply it to the industrialized applications that run on top. We need better abstractions to reach a comfortable level of quality assurance.

Cloud complexity is larger than human cognition, and cloud response times are far faster than human reactions. Mistakes, leading to virtual explosions of processes running out of control, have brought down entire datacentres, and more will doubtless be covered up in the future because of the poor use of tools. This is not to say that humans are redundant. We are the source of intent and judgement. Moreover, some processes are changed only slowly, such as training and tuning of data, and benefit from the kind of perspective that AI techniques can only dream of.

Data processing technology has to concern itself with simple quality assurance issues, especially in the face of scale and parallelism.

`Kollapsing’ a cluster

The first step along the safe path to scaling workflows is automation by scripting. Some may consider this to be the final step, but we think this would be a grave mistake. Scripts do help us to document and speed up the typing of directives, but at worst they scale catastrophe to a new level. At Aljabr we have invented a new kind of scripting language (called Ko, based on on previous generation work at DARPA and Google) and a number of tools including Koalja (more soon), designed to represent workflows at a functional abstraction, exposing the Kubernetes API, and allowing us to script Kubernetes with predictability and repeatability.

Our new functional scripting language Ko is a proof of concept, implemented in Go. It presents a functional model for describing concurrent workflows. Its vocabulary of function calls can be any code written in Go, meaning that it can expose any API written in Go for scripting. For example, the Kubernetes API is exposed allowing Ko to act as a macro-level-controller for Kubernetes . Ko can therefore be used to wrap an API of higher level pipeline abstractions for Kubernetes workflows (whether task-oriented or data-aware/oriented) as computations to be described by DAGs.

Using Ko is like using the Kubernetes CLI kubectl on the command line, but with data capture, and a DAG control flow superposed on top. Ko addresses some of the biggest pain points in running on top of a Kubernetes virtual machine: the matching of types in APIs, and the instantiation of containers, type inference, concurrency and deadlock-free synchronization. What Ko does not yet do is shield the user from unnecessary interior workings of Kubernetes, such as pods, namespaces, and labels. Yet, these are the very concepts that make Kubernetes hard to operate in many situations.

Here is some simple Ko code to start a database, in this case, CockroachDB:

StartCockroachDB() {
ns: TutorialNamespace()
deploy: startCockroachDBNodes(
after: CreateNamespace(namespace: ns)
namespace: ns
)
podsReady: waitForCockroachDBPodsReady(after_: deploy)
initStarted: initCockroachDB(after_: podsReady)
initDone: waitForCockroachDBInitDone(after_: initStarted)
return: initDone
}

It is plain to see that this is not like starting a shell script in a CLI. Kubernetes adds layers of concepts that get in the way. Starting a program (above) is complicated enough, but clearing up after a Kubernetes deployment is far worse (more Ko code below):

StopCockroachDB() {
namespace: TutorialNamespace()
// delete stateful sets
statefulSets: ListStatefulSetsByLabelSelector(
namespace: namespace
labelSelector: “app=cockroachdb”
)
deletedStatefulSets: DeleteStatefulSets(namespace: namespace, statefulSets: statefulSets)
// delete pods
pods: ListPodsByLabelSelector(
after_: deletedStatefulSets
namespace: namespace
labelSelector: “app=cockroachdb”
)
deletedPods: DeletePods(namespace: namespace, pods: pods)
// delete services
services: ListServicesByLabelSelector(
after_: deletedPods
namespace: namespace
labelSelector: “app=cockroachdb”
)
deletedServices: DeleteServices(namespace: namespace, services: services)
// delete jobs
jobs: ListJobsByLabelSelector(
after_: deletedServices
namespace: namespace
labelSelector: “app=cockroachdb”
)
deletedJobs: DeleteJobs(namespace: namespace, jobs: jobs)
// delete budgets
podDisruptionBudgets: ListPodDisruptionBudgetsByLabelSelector(
after_: deletedJobs
namespace: namespace
labelSelector: “app=cockroachdb”
)
deletedPodDisruptionBudgets: DeletePodDisruptionBudgets(
namespace: namespace
podDisruptionBudgets: podDisruptionBudgets
)
// done
done: (
after_: deletedPodDisruptionBudgets
msg: (cockroachdb_deleted: true)
)
return: Show(done.msg)
}

The complexity involved in cleaning up is legendary! This is not Ko’s fault, it is inherent in Kubernetes clustering at the toolkit level. Clearly, there is still a long way to go before Infrastructure-as-a-Service is simple and easy. It is desirable to reduce this complexity by studying it and factoring it out. One day, this will be possible with new distributed programming languages running atop distributed cloud operating systems, like those imagined in the 1980s and 90s. We are certainly moving in this direction with exciting new projects like Atomist’s SDM, Metaparticle, Ballerina, Pulumi and others.

This is already a step forward, at the level of an assembler code for compiling a simpler view into. What would a simpler view be? Now the discussion becomes steeped in culture and personality. Most would agree that it should be some kind of descriptive specification, but what? A pipeline description should not contain all the details, it should speak in patterns. Container clusters offer a way to manage modularity that develops quickly. A first step is to factor out obvious redundancy, and to wrap interior detail. There is a zoo of `proof of concept’ technologies (some aforementioned) exploring this space, but none of them seem to have hit a sweet spot “goldilocks” zone.

`Komposing’ a storyline

Scripting, even more than programming, is about telling simple storylines. How the ordering of operations is evaluated is a sore point for many users. Whether the order of declaration matches the order of execution matters a lot to intuitive understanding. Like CFEngine, Ko’s syntax is linear to a first approximation, but can be overridden by dependencies and recursion. It can be modularized in functional units. These are the basics of modular thinking, without becoming too opinionated.

A pipeline is a collection of cooperating component stages. If the components plug into an abstraction that feeds them defaults, like a namespace. Name-based networking offers some insights into how things can be simplified. A simple model of this was proposed in 2015 as “workspaces” and was approximately mapped onto Kubernetes concepts.

Users don’t tend to like having to do setup work without seeing results. A compromise might be to allow command line API with safe defaults. If they want to change these, they have to do some setup, which is not using command line switches. A possibility here is to simply expose APIs on the command line, but this feels fundamentally unsafe, particularly when such large resources can be called into play in the cloud. Safety guardrails were common in CFEngine and users hated them, even when they prevented major mistakes, because they always felt too conservative to users. Choosing an appropriate level of safety rules is always difficult.

A safe approach would keep all the checks and guard-rails defined in policy, but allow users to select bundles of commands or “promises” to be kept in a sequence, using a simple method name, and switches that could enable or disable predefined features. In this way, simple names are symbolic of very precise and regulated trusted behaviors.

So users could simply issue commands…

run-pipeline — name invocation1 — list namespace::pipelinerun-pipeline -n invocation2 -l method1,method2

…while being sure that a platform has their back. This was the approach used by CFEngine, with its cf-runagent, with moderate popularity. In this way, no commands get past the basic safety checks, but one can manually intervene within a batch schedule. This was the idea behind workspaces, with a “convention over configuration” philosophy to reduce manifest complexity.

The techniques for safe distributed execution are well known, and well tested by distributed configuration tools. Modernizing these approaches for the faster more responsive world of containerized workloads, while avoiding runaway complexity is the challenge for the cloud era. There is, however, more to say about how the components in a pipeline fit together, and we return to these in the next post…

--

--