Staying on Course: Data Workflows are Your Friend

Joerg Schnitzbauer
GAMMA — Part of BCG X
8 min readJun 7, 2018

by Joerg Schnitzbauer and Lukasz Bolikowski

Feeling sympathy for companies that wasted millions due to missteps early in the creation of a large data science project is part of our job. The purpose of this column is to share lessons we’ve learned helping businesses build data workflows to avoid such pitfalls — and allow companies to better employ advanced analytics so that they are conquerors in the business world rather than also-rans.

These days, data is the new oil fueling many of the globe’s most successful businesses. We’ve seen firsthand how important it is to be well-organized with data-driven research, a highly iterative process that involves lots of trial-and-error, requires multiple versions of raw input files, and produces great volumes of intermediate results. The more complex a project, the more difficult it is to track data dependencies and data lineage. This makes it even more essential from the start that data scientists set up a clean code structure and employ elementary data management practices to survive when complexity increases.

As we’ve seen working with companies large and small, a failure to stress code structuring from the outset of a project can mean introducing exponential technical problems that can make an orders-of-magnitude difference in efforts to scale into production, if not cause the cancellation of a project altogether.

And in 2018, if you’re not able to routinely employ advanced analytics to have a monetary impact on your business, you can be sure that your competition is.

There’s no single, standard way of proceeding. But there are best practices that make our research easier and basic steps that any data-science team can take when starting on a new journey, as well as specialized tools that can help to rapidly iterate through prototypes while simultaneously promoting a codebase that is ready to scale into production.

Data-centricity

Visuals help. We start each project at a white board, mapping out the steps needed for the journey we’re about to take. Just as any trip is more predictable with a map, we find that you’ll know better where you are and where you need to go with a visual representation. Call it a data map, or a data pipeline into the raw data that can help your business grow and maximize profit.

Data Science projects, we find, are often more data-driven than algorithm-driven. Hence, it is usually helpful to reflect this data-centricity in our code structure. Algorithms are important, but so are efforts to define the nature (or, even better, the schema) of inputs, outputs and interim data. Before designing algorithms, it’s essential that the data-processing flow be broken into manageable blocks and the APIs between them defined.

While the schema of input data is often known and fixed, interim and output data needs to be defined. Even with a clear vision of the business problem it is critical at the outset of a project to define in detail the output data (the concrete goal of the project). This helps not only to structure the codebase, but also ensures that those working on a data-driven project have an even sharper understanding of the business problem. Experience has taught us that it’s always a good idea to engage early on with business stakeholders when proposing and testing output data.

Such clean structure will have already paid dividends during a proof-of-concept phase. But it will prove even more critical once a project matures and a company starts industrializing its code. At that point, it’s essential to track data dependencies and understand data lineage.

Divide and conquer

After defining the output data schema, the next step is thinking about how we get from A (input) to B (output). In other words, defining the interim data. This can be a seemingly daunting task, but it can be made more manageable by breaking the data flow into pieces by establishing a hierarchical view. We might not yet know how to write an algorithm that gets us from A to B but maybe it would help to create an interim dataset along the way that we’ll call C. Or if getting from A to C still does not let us write an understandable piece of code, we add another interim data set: D (thus, A -> D -> C -> B). Of course, a real data flow can have branches and more complicated dependencies. But by repeating this process of adding interim datasets, there will emerge a clear picture of the algorithms that will help us transform one dataset into another. Repeating this process will mean reaching a state where the steps between interim datasets are so small that anyone involved on a project can easily define the tasks that transform one interim dataset into another. These “atomic” tasks are now where our code and algorithms come in. If it is difficult to describe a data-transforming task with a simple sentence, it is probably a good idea to break it down further.

Dependencies as a Directed Acyclic Graph

The dependencies between interim datasets are often not linear but rather complex. One task could produce multiple outputs, another may have multiple upstream dependencies. Some interim data may be a requirement for multiple downstream tasks. However, we can bring order to the complexity by realizing that we have created a “Directed Acyclic Graph,” or DAG.

Datasets and tasks are nodes of the graph. The links between them represent input and output connections. In Figure One below, links pointing from a dataset (represented by circles) to a task (represented by squares) show task inputs. Links pointing from a task to a dataset show task outputs. If we have done a proper job, we should find no circular dependencies within the graph — hence the term “acyclic.”

FIGURE ONE

An example of a Directed Acyclic Graph (DAG) for data processing. Round nodes represent datasets and square nodes represent tasks. A, B, E are input datasets, G is an output dataset. Arrows represent input and output of tasks. For example, task g is dependent on datasets C, D and F as inputs and produces G as an output.

The Benefits

The beauty of the approach described above is that interim and output data schema are decoupled from algorithmic logic. Tasks may implement any logic that honors a data set’s input and output schemas, much like defined function signatures in software development.

This insight comes with enormous benefits:

● We have modular code that allows us to quickly replace specific sections with a new algorithm without breaking the entire codebase. In our work, it is critical that we can rapidly iterate through repeated trial and error runs and test our model dynamically. Even if we have to replace larger sections of the DAG, including multiple tasks, we can still rely on the remaining upstream and downstream code. We will always be anchored at least to the input and final output data schema.

● Team collaboration is straightforward and manageable. When we are launching a project, we define the DAG and distribute coding of each task (work packages) among team members. If we need to simultaneously update the business logic of multiple tasks, team members can work on different sections of the DAG at the same time with minimal coordination, because they can rely on the data schema coming into the task they are working on. It all works — so long as work packages (the tasks assigned to each team member) don’t result in the restructuring of the DAG in conflicting ways. Even so, it’s still possible to restructure the DAG, if required, so long as there is adequate coordination among team members.

● By this time, we will have defined the “units” for unit tests. From the get-go, it’s essential to make sure that the code produces correct results. Instead of relying on ad hoc tests, we can write those tests once and benefit exponentially from that work.

● Instead of running the entire workflow every time (which can be time consuming), we can choose to quickly run only the sections we need. This is particularly useful when developing new logic for a particular task, as we can calculate prerequisite datasets once and “freeze” them as inputs for the section we’re working on.

● Increased transparency, especially for non-technical stakeholders. One doesn’t need to be a software developer or data scientist to understand the pipeline and algorithms being employed. In GAMMA, data scientists work side-by-side with management consultants and other generalists on a daily basis. By splitting the code into atomic tasks with clearly defined inputs and outputs, it’s easier to communicate the scope of our work to non-data scientists. Moreover, when, for example, intermediate files are in CSV format, our colleagues can inspect them in Excel and gain a deeper understanding of the model — or help debug problems without reading a single line of code!

Useful frameworks

A number of trailblazers have tackled the cumbersome effort involved when coding pipelines manually by creating programming frameworks. Various frameworks have grown mature especially in “Extract, Transform, Load” (ETL) environments. Specifically, we want to call your attention to two very popular frameworks, Luigi and Airflow, and also shamelessly plug two lighter weight alternatives of our own creation: elkhound and dalymi.

The flagships: Luigi and Airflow

DAGs are written in Python in both Luigi and Airflow, but tasks may be non-Python operations, such as bash scripts, Hive queries, Spark jobs, etc. Luigi provides an informative user interface that makes it easy to visualize the DAG and monitor its execution, so long as it is set up using a web server provided with the package. Airflow takes this approach one step further. Its web server not only provides an informative interface but an interactive one that allows users to trigger DAG runs, control the execution of specific tasks, review logs, etc.

The task triggering system each uses is the key difference between Luigi and Airflow. Luigi uses a “GNU- Make”-style dependency model. This means that each task has one or multiple defined output targets (in the simplest case, a local file) and a task is considered “done” when those output targets are produced. Consequently, Luigi only schedules tasks for execution if required tasks have produced their output. Airflow, on the other hand, records DAG and task execution in a database. This makes it easy to explore logs of past pipeline executions, but also requires additional effort for setup and maintenance.

The lightweights: elkhound and dalymi

Airflow and Luigi are both recommendable solutions for the challenges of ETL and the execution of mature data processing pipelines. Yet while they shine in production settings, at GAMMA, we often iterate at rapid speed through model alternatives and test settings, especially during the proof-of-concept phases. As a result of our specialized requirements, a couple of lightweight frameworks have emerged independently as open-source projects from inside GAMMA: elkhound and dalymi (“data like you mean it”). Both are still in early development stages but they already solve workflow challenges by enabling rapid model development while promoting healthy coding practices. Such environments keep our work agile and dynamic without creating a large technical debt that would otherwise be due when transitioning into production.

Summary

Regardless of which framework you choose, any data scientist will benefit from structuring his or her work in the form of atomic tasks with clearly defined inputs, outputs and dependencies. Our advice: use data workflows and be a conqueror of the digital age. Don’t be impatient and dive rashly into coding and algorithm design, or you’ll risk producing unmaintainable code and fail to answer the central business question. Don’t be afraid to learn and utilize frameworks; they will help reduce technicalities so that you can focus instead on what really matters: the business logic. Structure is your friend: embrace it.

Happy coding!

--

--