#1: Introduction to Aljabr: Pipelines and Workflows
Workflows are often automated process sequences that transform an input into an output. Historically, workflows account for the bulk of industrial processes and business pipelines around the world. Our goal at Aljabr is to apply them so that it is easy for you to build software, to roll out services, to analyze data streams, or to train machine learning algorithms. The sky’s the limit.
Data processing used to be more straightforward (if not always easy) before cloud computing. You put your numbers in, you got your numbers out. With the cloud, big or small data, and a bunch of new tools, the complexity is piling up! First, there is all the plumbing of the cloud to contend with, and then there are all the new ways of doing data processing. That’s a lot of moving parts!
Our goal, in this blog, is to investigate and explore the development and concepts of workflow and data processing from top to bottom — and to discuss how we can scale and dramatically simplify these patterns for everyone.
Pipes and Workflows
There are two main kinds of IT workflows:
- Build and deployment pipelines
- Data processing applications
They are both different and similar.
Deployment pipelines are usually the domain of DevOps practitioners / IT administrators and systems engineers. Data processing is usually the domain of programmers and business analysts. In both cases, you start with an initial state (the state of a system, or a bunch of data) and you want to change it into something more valuable.
In a deployment, the workflow tends to look like a runbook, i.e. a series of commands to be executed as a script, in a certain order:
In data processing, the workflow looks like a computer program, taking variables and computing a result:
result1 = function(input)
result2 = function(result1)
In both cases, there may be several stages. In both cases, there is an input and an output at each stage. In a command script, the inputs and outputs are often implicit, or specified as parameters to the commands. In a computer program, the input and outputs are explicit variables.
These two types of workflows have subtle differences in viewpoint. A command-based pipeline frames the problem as a sequence of operations lined up to be performed on a stationary patient (the system, or some data), while a data-processing pipeline frames the data as produce manufactured along a kind of conveyor belt, as it passes through a number of stages that transform it in a production line:
What unifies these two kinds of worflows together is a simple flow of transformations from start to finish. There might be implicit recursion, function calls, exterior service dependencies (internally), but there is also a simple focus on outcomes, or business logic, that tells a story from start to finish. This is what makes pipelines so important.
Some examples of processes that could be handled as data processing pipelines:
- Machine learning, training, testing and deployment
- Continuous delivery build pipelines, including testing, integration and more
- Integrated form submission, combining multiple systems
The list of cases is almost limitless, and while not all examples have to be handled as pipelines, the principles cover a wide range of applications.
In the coming posts, we will cover some traditional approaches to running pipelines, and explore how these might be improved upon and perhaps reimagined for the cloud-native era!