#7: Data in, story out

Making the data flow

Published in

Aljabr

8 min readOct 3, 2018

In post #6, we looked at what pipeline control might look like on top of a cloud-native operating environment like Kubernetes, at the level of process-oriented containers. But probably the most important feature of a pipeline is the processing of some inputs leading to predictable and inspectable outputs. In this post, we ask how do we get data in and out of a pipeline? Whether we are talking about containerized code for execution, or I/O to be fed in from a source, there are two aspects to consider: 1) What is the user experience? and 2) What actually happens behind the scenes? Throughout several different stages, we want what comes out to be fed into the next stage, but usually only if it passes certain quality criteria.

Whatever kind of pipeline you have, process artifacts are produced and then handed off, stage by stage. If the inputs are infinitesimal changes, like single events in a stream, the output may be as simple as a single database update. If there is a new version of software to be compiled, the result might be a new package, ready for consumption.

Finding the right moment

In practice, there is likely to be a batch of aggregated changes handled together, for economy of scale. A pipeline never outputs anything faster than data coming in, and usually slower, because even though our instinct is to want everything as fast as possible, some results can only be computed after a certain time has elapsed. For example, take the case of continuous delivery pipelines:

It’s not necessary to commit every separate change to a version control system.
It’s not necessary to rebuild every committed change.
It’s not necessary to deploy every rebuild.

In each case, we wait until enough related changes have accumulated and are repackaged into a new kind of “unit” before processing, like filling up boxes, containers, or even blocks in a blockchain (see figure below). Whatever we choose for the right time to move on to the next stage, it is a policy decision. Efficiency will always tend to favor batch processing. Even for raw data capture, where we imagine capturing every event from some source, there is often a need for some kind of filtering or selection, which adds an effective aggregation stage.

This is exactly the story of packetization in computer networks, only applied to high level data, instead of low level data. That should tell us that there is no fundamental difference between a stream and a packetized batch process. What counts is how you parse the data on each end.

Batch sizes are important in several ways. If the coarse grain size for data is too small, we waste resources and court instability; if it is too large, important changes may be missed, or arrive too late.

This applies in machine learning especially, where too much training on detail may result in a network that recognizes exclusively what it was trained on (with little ability to generalize), while too little training may result in missing important cases.

Data science is not an exact science!

Processing at the edge

The size of data batches is one issue. Another, even more important issue, is getting the data into a pipeline in the first place. Data collection is okay for some applications, but the future of data pipelining is to process at the edge of the network, where candidate data can be selected immediately to avoid unnecessary and expensive transportation. This means processing at the very source of the data. It is particularly relevant for the Internet of Things and in smart environments, where data originate typically from sensors and local processes.

The amount of data collectable in these scenarios is much too large to be transported uncritically across the network to a cloud-based data center. With some edge processing, however, filtering and batching it can be plausible from the source itself. The more marshaling of data that happens before transport, the more efficient the processing can be. This places some requirements on the management of the infrastructure. The management of resources outside of a single data center is a much bigger headache, involving subtle timing issues and coordination complexities, which we’ll return to later.

The challenges of input scaling mean that each stage has data acceptance decisions to make:

Whether a receiver accepts data offered by a remote source (downstream).
Whether a source accepts a request for data from a receiver (upstream).
Whether the data from the source are appropriate or acceptable (semantic validation).

The decision to accept data into a pipeline may be about efficiency, but also about integrity, and even self-protection (e.g. all services need to protect against DDOS attacks and malicious intent). Security is a real concern, even for an application that is not traditionally considered a risk. There are various ways to handle this, using push and pull based processing. The optimal choices depend on the timescales involved in the data source and pipeline’s topology. These decisions are sometimes taken for granted, but that is what we cannot afford to do in a distributed system. Everything is more complicated at scale.

Language barriers

Kubernetes can provide a few basic protections against malicious submission, between interior and exterior networking of a container cluster, but if services are deployed across wide areas, even across different clouds, then this has to be handled explicitly.

A more obvious criterion for accepting data is that we should understand what message the data events convey. Pipeline stages pass data in various formats. The data types sent and received will have to be covered by a standard lingua franca in order to be understood (something like a protocol buffer or standard data format, possibly with a GraphQL-like selection of offered types). It is also assumed that each stage of the pipeline is granted access to the offered data. Remember, that in a distributed cluster, there is no automatic right to read data produced by a separate process.

To summarize, we can think of pipelines as a series of edges at which processing takes place to accept inputs and transform them. Where does the pipe truly start? These are the questions that are going to mean a lot in the future…

A statement of work

Behind the scenes, protocols and software will take care of the implementation of message passing. Meanwhile, workflow architects need to express their intent to connect components into a pipeline, according to some kind of policy. There are different approaches to this. The cleanest is perhaps what we would do in the case of mathematical processing, where the affected quantities are simple arrays of numbers. Every programmer understands the notation:

output = stage_function(input),

…usable in succession for each of the stages. Whether this notation makes sense or not in other cases (or is even implementable), e.g. in continuous delivery builds, remains to be discussed.

Another approach is to use a simple data specifications such as YAML or JSON. These are not really languages, they are data formats. Asking users to jump through hoops to make entry simple for developers is not a user friendly act, but is has nevertheless become a popular norm.

Pipeline:
   Name : my_pipe
   Input: /path/file1
   Output: /path/file2

Let’s now look at some examples of this user experience from data processing applications.

Google’s TensorFlow

Google’s highly popular machine learning library, with its Python front end, is perhaps the easiest data processing sequence to understand from a programming perspective. This is partly because the data representations are clearly defined from the beginning, whereas in other cases they aren’t. A TensorFlow program has calls that look something like this (TensorFlow example):

import tensorflow as tftf.enable_eager_execution()
tfe = tf.contrib.eagerx = tfe.Variable(np.random.randn())# . . .optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
grad = tfe.implicit_gradients(mean_square_fn)for step in range(num_steps):
 optimizer.apply_gradients(grad(linear_regression, train_X, train_Y))

In this approach, we pass data by variable reference. The references point to huge tensor arrays, stored in some undefined manner. But for all intents and purposes, this looks like a straightforward single machine computation. We see policy choices (eager execution) along side transformations and apparent data assignments. The code is fairly clean and minimal, with the details well hidden. This is only possible because the initial problem is quite well defined.

Apache Airflow

For comparison, Airflow is a tool for implementing task driven pipelines. It uses an opinionated Python (exclusively) front-end description to build a directed acyclic graph (DAG), which in turn represents the process. For this example, only some dummy shell commands are used:

dag = DAG(‘tutorial’, default_args=default_args)# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
 task_id=’print_date’,
 bash_command=’date’,
 dag=dag)t2 = BashOperator(
 task_id=’sleep’,
 bash_command=’sleep 5',
 retries=3,
 dag=dag)templated_command = “””
 {% for i in range(5) %}
 echo “{{ ds }}”
 echo “{{ macros.ds_add(ds, 7)}}”
 echo “{{ params.my_param }}”
 {% endfor %}
“””t3 = BashOperator(
 task_id=’templated’,
 bash_command=templated_command,
 params={‘my_param’: ‘Parameter I passed in’},
 dag=dag)t2.set_upstream(t1)
t3.set_upstream(t1)

There is no explicit data transfer, only execution details. In fact, data handling is completely out of band, which makes it mysterious: there is only a partially ordered set of jobs to be executed. The code here is opaque, and muddles shell commands together with pipeline wrapping, and context variables. The precedence relationships of the DAG have to be added explicitly. This is not unlike the CFEngine example in post #4, which was not explicitly designed to make pipelines easy, so it became clumsy. This would be problem with a design when not explicitly designed for pipelining, but Airflow is, so we can still do better.

Because Python ordering is already literal, we might wonder why the explicit ordering of dependencies is needed. The point is that the language is implicitly serial by design, so it would try to serialize everything unnecessarily: by defining dependencies, the executor might be able to parallelize some parts of the pipeline. So, we should view this requirement positively!

Ko

Running on top of Kubernetes is a goal for many. Kubernetes can simplify a number of issues, but without automation, it’s a headache with many moving parts. At Aljabr we have developed Ko as our own scripting language, written in Go, and tailored to the task of scripting Kubernetes. We can compare the previous examples to a Ko pipeline. Consider an example:

pipe(d1, d2) {
 g1: G1(d1: d1, d2: d2)g2: G2(g1)g3: G3(g2)
 return: Sum(left: g1, right: g3)}

This example above defines a function-like object with two parameter inputs. The lines are executed in parallel unless subsequent lines depend on one another, in which case Ko will arrange the serialization. Type recognition is handled by inference to keep the burden of formalism to a minimum. Recursion of new function objects is allowed, and the return value is manifestly declared. The result is simple and intuitive for a programmer, though it might not be for other data science personas in industry.

Ko is skeletal in its form, expressing only the bare bones relationships at face value. There are parameters for passing previous stage data into “functions” that process them and return results, and data types can be used to conceal implicit details of the data relationships. Here, none of the behavioral details relating to cloud instances or services are explicit. This means they must be set out of band as part of a policy, so the simplicity is partly illusory, though desirable. We will explore this in more detail in future posts.

Where to go from here?

In each of the example cases above, the clarity of process expression is obviously colored by the native syntax of the languages used to embed the descriptions. This has advantages and disadvantages. It’s not hard to see that there are plenty of concrete issues around the design of pipelines: how to connect them, how to define them, how to make them safe. But there are also cultural issues. In the next post, we’ll make a few proposals!