Building ML Pipelines

What is a DAG?

John Aven
Hashmap, an NTT DATA Company
6 min readNov 19, 2020

--

In modern computing solutions, the concept of a DAG or Directed Acyclic Graph is central. While the term DAG has become quite the buzz word: understanding what they are, how they are used in computing, and how/where they show up in data science and machine learning is not just buzz. In short, a DAG describes a sequence of execution steps in the complex non-recurring computation.

How often do you come across the need to create a DAG in machine learning?

Every. Single. Day.

Machine learning can be constructively defined to be “the art of building DAGs to treat and transform data using a sequence of advanced mathematical transforms to generate a repeatable formulation that solves a problem for new data.” In Data Science and Machine Learning, a pipeline or workflow is nothing but a DAG. Note that this is not the only place where DAGs are found in Data Science/Machine Learning.

The point is, as you build your ML code, you need to orchestrate your workflow. There really is little reason to do this manually. Unfortunately, not everyone is aware of the tools out there. A first step is to understand what you are building — a DAG — and help you down that path. So let’s get started!

DAGs Defined

So, why do we call them DAGs? A DAG is a Directed Acyclic Graph — a mathematical abstraction of a pipeline. Let’s break this down a bit, though.

A graph is a collection of vertices (or point) and edges (or lines) that indicate connections between the vertices.

Simple Graph

A directed graph is a graph in which the edges point in a direction from one vertex to another. In this case, two edges between the same pair of vertices may exist, but in this case, they will point in opposite directions. A directed graph is often referred to as a digraph.

A simple digraph

A graph is cyclic if it contains one or more cycles, where a cycle is defined as a path between vertices along edges that allows you to return to a vertex along with a unique set of edges. A graph is acyclic when it contains no cycles.

Digraph with a highlighted cycle

Therefore, a directed acyclic graph or DAG is a directed graph with no cycles. A rather simple concept once you can put some definition to it. And look at that — you came here to read about DAGs in ML Pipelines, and you got a mini-lesson in Graph Theory!

Simple DAG

So, in the end, the reason that the DAG terminology is used is not just because it is cool to say DAG, but because it describes the nature of a workflow in cases where you NEVER look back to previous steps.

Now, you may be saying… “We loop back all the time in machine learning; the model training step is fraught with it when you are optimizing, recurrent neural networks loop back on themselves, and so on!” — You are correct. A DAG is not a universal descriptor of a pipeline when you zoom in, but it is a descriptor at the high level of execution; it describes the steps being taken — aka the pipeline.

Tools for & with DAGs

Now, while some folks still will build their pipelines manually, and by some, I mean many or most, this is very labor-intensive — time-intensive — hard to repeat!

Building your pipeline can be done with many tools. A few of which are:

  • Apache Airflow — python-based orchestration platform built for operations teams and, to some extent, data science. One of the best UIs in this space.
Sample ML Pipeline in Airflow
  • Prefect — programmatic orchestration platform built to be used by developers. It is an emerging tool in this space that is gaining traction.
Sample ML Pipeline in Prefect Core
  • Argo — Workflow orchestration built for the cloud-native space. Declarative and with a simple to use UI. See my article here, where I address using Argo to build ML Pipelines.
Sample ML Pipeline in Argo
  • Luigi — Another open-source solution built to be used by developers.
  • Ctrl-M — Popular enterprise workflow management tool from BMC.
  • TensorFlow-Extended (TFX) — Orchestration layer used in TensorFlow for Deep Learning pipeline orchestration.

These are tools that help you build compute pipelines as DAGs. They each have their own learning curves and cost/benefit tradeoffs. Choosing what is right for you truly depends on your operating environment, enterprise strategy, level of user expertise required, and whether you lean towards Open Source Solutions or commercial solutions.

Now, beyond building pipelines, and just as a matter of curiosity, DAGs are also present in many of the computing solutions:

And so many more places. It ends up that knowing this concept will help you understand the execution of many processes (not just those related to machine learning pipelines).

Now, go out and read more, learn more, and explore the options out there. There are many tools out there. Look through your code; are you creating DAGs and not aware that you are? If you don’t want to use one of the solutions mentioned, you could even build your own home-grown solution; e.g., by utilizing the networkx library in Python. But guess what? If you are doing that, then consider using Prefect to do this — it is low-level and meant to be used at that programmatic level.

Working with Hashmap

Are you overwhelmed with the concept of a DAG? Which DAG focused orchestration tool should you adopt — or are you using the right tool for training and deploying your ML pipelines? Hashmap can help you here. Our machine learning and MLOps experts are here to help you on your journey — to bring you and your organization to the next level. Let us help you get ahead of your competition and become truly efficient in your data analytics.

If you’d like assistance along the way, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, data science, MLOps, and various other technology consulting services.

Other Tools and Content You Might Like

To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

John Aven, Ph.D., is the Director of Engineering at Hashmap: providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

John Aven
Hashmap, an NTT DATA Company

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”