#2: The point of departure is not to return
Getting started by scripting
There is a whole zoo of possible workflow technologies, revealing different aspects of the problem of data processing. In the next few posts, we’ll unpack some of their salient points to see where the future is heading. We begin of course with the classic shell script — the traditional language of system processing. It has never become obsolete, but many attempts have been made to improve upon it, often fighting our innate need to do something quick and dirty.
Here is a simple programming workflow as a pipeline script:
# Ballistic build method
gcc -c hello.c
gcc -c world.c
gcc -o hello hello.o
gcc -o world world.o
It produces the following intended output, but not before a lot of working output to build the C code.
Shell scripts have added control flow, such as loops, decisions or subroutines, in order to handle cases that fall outside simple checklists, but they were not intended for substantial programming. Scripts are only meant for processing batches of commands, for making job control flow explicit, to avoid typing mistakes and repetition. They employ predetermined commands or programs, in which the change logic is hidden inside the commands.
In a fuller programming model, we expose more of the details of logic and workflow. Less is presumed in “software” as we understand it today. For example, Python has become the defacto standard for data processing. Python programs can express computation with layers of detail, decisions, subroutines, loops, and more.
Both forms of workflow are easy enough to execute on a single computer, but making them work in the distributed cloud requires many new steps: user registration, credentials, access, and decisions about when and where the executions will take place, and more. Remote shell access allows us to log onto cloud machines, but this is increasingly frowned upon, for its lack of scalability and error prone nature. More often, these days, we want to package workloads as containers or file bundles to be executed handsfree. We interact programmatically, by scripting through APIs rather than by keyboard, or we even automate the entire process using inboxes and outboxes for the inputs and outputs.
Data might start out as files, databases, logs or journals, or even be fed directly into an open socket from sensors (e.g. in the case of IoT). Why should it matter? Whenever new technologies are advanced, we get obsessed with the branding of the innovation (S3, Docker, Kubernetes, etc), but if we are going to make processes easy again, we need to handle these decisions more transparently. Those brands need to go away.
At one level, nothing has really changed in the way we think about control flow and job execution. But in reality, the shift towards infrastructure and software as services has led to a fat layer of dependencies inserting themselves between user and outcome. The time is therefore ripe to revisit the separation of concerns: how can we take away the manual assembly of parts and solve repeated job execution on top of the new platform of distributed cloud?
The answers will depend on the kind of pipelines, as we mentioned in the previous post, but there are still many options to consider, and many approaches. So before we get there, it’s useful to look over some of the history of solving the problem.