At Flatiron, a large number of systems work together to process patient data and de-identify it. One of these systems, the Client Packager, is responsible for the post-processing of our Enhanced Data Mart (EDM) products, which are the datasets that we deliver to cancer researchers. This data system must carefully aggregate, verify, and format the EDMs before they can be shipped off for delivery. As the number of data products we deliver and clients we deliver to grows, we have noticed that the original system design is no longer supporting our current scale of operation.
The heart of the issue lies with the infrastructure. A single AWS EC2 instance is used to execute the Client Packager. While the instance itself is capable, it is shared across several engineering teams for batch processing, continuous integration, and ad-hoc remote execution. Resource contention can become a limiting factor for computationally-intensive jobs. Job status is communicated through Slack and email with application logs remaining on the host. Needless to say, this infrastructure design works fine for a nimble startup, but starts to break down as the frequency and length of jobs increases over time. Current jobs can take up to ten hours to complete.
In an ideal state, we would have individual teams operate their own cluster of machines allowing data pipelines to scale according to each team’s requirements. Teams should manage the security and permissioning of their own systems, enforcing the principle of least privilege. It’s also important that application logs and intermediate data pipeline results be transparent and accessible. In order to achieve these goals and allow our particular data system to scale, we adopted Apache Airflow, an open source task orchestration framework.
In Airflow, data pipelines are defined as a series of tasks and dependencies. Together, they define a workflow that can be represented as a Directed Acyclic Graph (DAG). Each task in Airflow can be independently distributed to a different worker in a cluster allowing workflows to scale. Additionally, the status of each workflow is easily visible through the Airflow UI allowing transparent monitoring of data pipelines. Within Flatiron, Airflow infrastructure is administered such that each team’s cluster is provisioned separately allowing for isolated job scheduling and permissioning.
While there are several benefits to migrating the Client Packager to Airflow, doing so requires overcoming a series of challenges around distributed code execution and message passing.
In order to distribute any process across a cluster of workers we need a way to distribute the source code. Syncing a large codebase and installing the proper execution environment can take several minutes. For longer running processes, this is not a significant amount of time. However, this introduces unnecessary failure points and limits shorter running tasks from being effective. A common solution here is to containerize the application, which is what we have done using Docker. To keep our containers lightweight, we use Bazel, an open source software build tool, to package subsets of code into libraries so that only the necessary dependencies can be included. By organizing the code and execution environment into a single image file, we can quickly distribute our tasks to cluster workers.
Another challenge of distributing our tasks across cluster workers is that these tasks need to pass large datasets to each other. When we were running the Client Packager on a single instance, we could safely assume the local filesystem was always available for persistent reads and writes. Once we moved the workflow onto a cluster, we had to be more deliberate about data communication since tasks that share data no longer share a local filesystem. One solution was to leverage AWS S3 as a remote filesystem to download inputs and upload outputs at the end of each task. This had the advantage of allowing each task access to the data output of any dependencies that ran further up the DAG. Uploading intermediate results to AWS S3 also made it possible to review the data outputs for correctness and completeness.
The refactored Client Packager running on a distributed Airflow cluster drastically improves its scalability. This is just one example of how Flatiron rethinks infrastructure as the company continues to grow.