Aljabr
Published in

Aljabr

#1: Introduction to Aljabr: Pipelines and Workflows

Workflows are often automated process sequences that transform an input into an output. Historically, workflows account for the bulk of industrial processes and business pipelines around the world. Our goal at Aljabr is to apply them so that it is easy for you to build software, to roll out services, to analyze data streams, or to train machine learning algorithms. The sky’s the limit.

Data processing used to be more straightforward (if not always easy) before cloud computing. You put your numbers in, you got your numbers out. With the cloud, big or small data, and a bunch of new tools, the complexity is piling up! First, there is all the plumbing of the cloud to contend with, and then there are all the new ways of doing data processing. That’s a lot of moving parts!

Our goal, in this blog, is to investigate and explore the development and concepts of workflow and data processing from top to bottom — and to discuss how we can scale and dramatically simplify these patterns for everyone.

Pipes and Workflows

There are two main kinds of IT workflows:

  • Build and deployment pipelines
  • Data processing applications

They are both different and similar.

Deployment pipelines are usually the domain of DevOps practitioners / IT administrators and systems engineers. Data processing is usually the domain of programmers and business analysts. In both cases, you start with an initial state (the state of a system, or a bunch of data) and you want to change it into something more valuable.

In a deployment, the workflow tends to look like a runbook, i.e. a series of commands to be executed as a script, in a certain order:

# Run
Command 1
Command 2

In data processing, the workflow looks like a computer program, taking variables and computing a result:

result1 = function(input)
result2 = function(result1)

In both cases, there may be several stages. In both cases, there is an input and an output at each stage. In a command script, the inputs and outputs are often implicit, or specified as parameters to the commands. In a computer program, the input and outputs are explicit variables.

These two types of workflows have subtle differences in viewpoint. A command-based pipeline frames the problem as a sequence of operations lined up to be performed on a stationary patient (the system, or some data), while a data-processing pipeline frames the data as produce manufactured along a kind of conveyor belt, as it passes through a number of stages that transform it in a production line:

Two relative ways of viewing workflows: 1) stationary work operated on by a sequence of operators, or 2) a flow of work along a conveyor belt.

What unifies these two kinds of worflows together is a simple flow of transformations from start to finish. There might be implicit recursion, function calls, exterior service dependencies (internally), but there is also a simple focus on outcomes, or business logic, that tells a story from start to finish. This is what makes pipelines so important.

Examples

Some examples of processes that could be handled as data processing pipelines:

  • Machine learning, training, testing and deployment
  • Continuous delivery build pipelines, including testing, integration and more
  • Integrated form submission, combining multiple systems

The list of cases is almost limitless, and while not all examples have to be handled as pipelines, the principles cover a wide range of applications.

Stay tuned

In the coming posts, we will cover some traditional approaches to running pipelines, and explore how these might be improved upon and perhaps reimagined for the cloud-native era!

--

--

--

Simple, Smart Data Pipelines

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aljabr, Inc.

Aljabr, Inc.

Simple, Smart Data Pipelines

More from Medium

kafkaVision: An open-source monitoring tool for Apache Kafka

Stream Landing Kafka Data to Object Storage using Terraform

Autoscaling your Airflow using DataDog External Metrics

Know your limits Kubernetes