Setup and Run CWL-Airflow Workflows With Docker Compose

Tony Tannous
The Startup
Published in
9 min readSep 27, 2020

The inspiration for writing this article came after reading the following on Airflow extensibility using CWL-Airflow:

The latest documentation, published by the above developer/s, can be found at, https://cwl-airflow.readthedocs.io/en/latest.

With the documentation link above and concepts learned from https://github.com/puckel/docker-airflow, what follows is an outline of setting up a cwl-airflow Docker Compose stack.

A git repo containing the stack components be found at this link.

This stack is not intended for use on a public network. Windows 10 WLS2 running Ubuntu 20.0 LTS from the MS Store was used as the host environment, with WSL2 integration enabled via Docker for Desktop.

Preliminary Notes

The core components and their equivalent version numbers include Airflow version 1.10.11 and CWL-Airflow 1.2.2. Airflow was configured to run in CeleryExecutor mode with a single Worker and MySQL metadata backend.

In order to successfully execute CWL workflows with a DockerRequirement specification was somewhat of a trial and error task, and as such, some workarounds were required to get things to work as they should (mostly). This was mainly related to host:container volume mounts of the CWL-Airflow specific folders (cwl_tmp_folder, cwl_outputs_folder, cwl_inputs_folder, cwl_pickle_folder). The absolute paths for these needed to specified within the docker-compose.yml and must match the absolute paths within the Airflow containers. By default, CWL-airflow creates these folders at $AIRFLOW_HOME/.

To address this issue, user airflow was created on the host and as part of the docker image with a $HOME directory on the host identical to $AIRFLOW_HOME within the airflow containers.

Setup Host User

On the Linux/Ubuntu host, create a new user airflow with home directory /home/airflow:

$ sudo useradd -ms /bin/bash -d /home/airflow airflow

Add user airflow to sudo & docker groups:

$ sudo usermod -aG sudo airflow
$ sudo usermod -aG docker airflow

Allow airflow to sudo without password prompt by adding the following entry using visudo:

airflow ALL=(ALL) NOPASSWD: ALL

Build Airflow Image with CWL-Airflow Support

Logon to the host as airflow

$ sudo su - airflow

Clone the compose stack repo to a temp location and move the folders we need from the repo to the target /home/airflow directory:

$ cd /home/airflow
$ mkdir tmp
$ git clone https://github.com/tonys-code-base/cwl-airflow-stack.git /home/airflow/tmp
$ rm -fr /home/airflow/tmp/.git /home/airflow/tmp/.gitignore
$ mv /home/airflow/tmp/* /home/airflow/
$ rm -fr /home/airflow/tmp

Build the docker image with tag of cwl-airflow-docker

$ cd /home/airflow/cwl-airflow-docker
$ docker build -t cwl-airflow-docker .

Run the Docker Compose Stack

Bring up the stack:

$ cd /home/airflow
$ docker-compose up -d

The dags folder should contain clean_dag_run.py, which is created as part of CWL-Airflow's initialisation steps.

Wait until the containers/services are up and running. You can follow/tail the Webserver container logs to check the webserver status using:

$ docker logs --follow webserver

CWL/Airflow Stack Components

  • Runs in CeleryExecutor mode with a single Worker.
  • Consists of 7 networked services/containers (redis, mysqldb, flower, webserver, scheduler, worker, mysql_admin_portal)
  • Key component version numbers
    Apache-airflow = 1.10.11
    CWL-airflow = 1.2.2
  • CWL Dags can be triggered via the Airflow Web UI, Native experimental API, CWL-Airflow API or by using the Airflow CLI

Stack Config and Default Credentials

Some of the Airflow stack default configuration parameters and their corresponding values reside within cwl-airflow-docker/config.env. These are initialised by the Entrypoint shell script of the CWL-Airflow image.

Once the stack is up, most of the main services can be access via the host machine. The exposed services are listed below.

Airflow/CWL Service

⚠ Notes regarding the APIs:

  • The native Airflow API accepts requests on port 8080 and supports username/password authentication, which is enabled for the stack. Either anAuthorization request header of the form Authorization: Basic <credentials> (or user:password) are required for authentication.
  • The CWL-Airflow API listens for requests on 8081. This API includes support for Workflow Execution Services (https://github.com/ga4gh/workflow-execution-service-schemas). Authentication is not supported with current version of CWL-Airflow (1.22)

MySQL service/DB

  • Airflow metadata database: airflowdb
    MySQL Username: airflow
    Password: sample

The backend database can be administered/queried by accessing the phpMyAdmin portal at the following URL:

Flower (Celery Monitoring)

Sample Workflows

Three CWL Workflow examples have been included within the repo contents and should be visible via the Airflow Web Server UI once the stack is up and running. These examples were derived from existing sources and have been transformed to fit into the framework/stack architecture.

  1. snpeff-workflow-dag.py : Derived from Tutorial on making bioinformatics repeatable
  2. alpine-docker-dag.py : Derived from examples at commonwl
  3. dna2protein_dag.py : Derived from repo rabix/bunny

The components and steps involved in running the first of these will be described next.

Executing snpeff DAG/Workflow

The corresponding CWL-Airflow DAG (snpeff-workflow-dag.py), created as shown below with dag_id=snpeff-dag, references the absolute path to the workflow be executed.

from cwl_airflow.extensions.cwldag import CWLDAG

dag = CWLDAG(workflow="/home/airflow/sample-workflows/snpeff/snpeff-workflow.cwl",dag_id="snpeff-dag")

The components specific to this workflow (within the repo) are as follows:

/home/airflow

├── dags
│ └── snpeff-workflow-dag.py

├── sample-workflows
│ └── snpeff
│ ├── data
│ │ └── chr22.truncated.nosamples.1kg.vcf.gz
│ ├── snpeff-workflow-inputs.json
│ ├── snpeff-workflow.cwl
│ └── tools
│ ├── Dockerfile
│ ├── gunzip.cwl
│ └── snpeff.cwl

Workflow Components:

  • snpeff-workflow-dag.py is the CWL-Airflow DAG with dag_id = snpeff-dag
  • snpeff-workflow.cwl: The CWL workflow containing the steps/input definitions
  • chr22.truncated.nosamples.1kg.vcf.gz: Test input file
  • snpeff-workflow-inputs.json: CWL job
  • tools : This folder contains the tools used by the steps within the workflow
  1. gunzip.cwl - gunzip .gz file
  2. snpeff.cwl - Builds and runs a docker image using the extracted file and a genome string as inputs

The visual below shows workflow snpeff-workflow.cwl as it appears when viewed using Rabix Composer. Testing was performed prior to porting to a CWL-Airflow DAG using Rabix with cwltool as the executor.

The DAG can be viewed via the Airflow Web UI. By default its status is set to “paused/Off”.

You will notice two extra tasks CWLJobDispatcher and CWLJobGatherer. These are automatically created by CWL-Airflow framework.

Enable snpeff-dag DAG

To prepare the for execution, unpause the DAG via the WebUI by toggling its status from “Off” to On”.

There are a number of ways to trigger the DAG and some of these are described next.

Triggering the snpeff-dag/Workflow using the Web UI

To trigger the DAG from the Web UI, we can pass the contents of the CWL job file (snpeff-workflow-inputs.json) as a JSON key value ("job":<value>) using the Trigger DAG JSON Configuration input area.

{
"job": {
"infile": {
"class": "File",
"path": "/home/airflow/sample-workflows/snpeff/data/chr22.truncated.nosamples.1kg.vcf.gz"
},
"genome": "hg19"
}
}

Below is a sample output from a successful run.

The workflow output components can be found at /home/airflow/cwl_outputs_folder/<dag_id>/*.

For our DAG with dag_id = snpeff-dag, the output will be located at:

/home/airflow/cwl_outputs_folder
└── snpeff-dag
└── manual__2020-09-23T07_58_45.578501_00_00
├── output.vcf
├── snpEff_genes.txt
├── snpEff_summary.html
└── workflow_report.json

The first 3 files in the list above (output.vcf, snpEff_genes.txt, snpEff_summary.html) are workflow output files. The last file workflow_report.json contains an output report of the files produced along with their corresponding attributes.

Trigger snpeff-dag using the Native API

As mentioned previously, API authentication is enabled for the stack and in order to make calls to the endpoints, an API user will need to be created.

A custom python script (api_user_setup.py) has been included with the Docker image and can be used to create an API user and HTTP Authorization token. Either of these can be used to authenticate.

Creating an API user

The following example creates an API User: apiuser with corresponding password: apiuser.

$ docker exec -ti webserver \
python api_user_setup.py -api add -u apiuser -p apiuser

...
...

* Ok --> User apiuser has been created.

* Use the following in the RestAPI header to authenticate -->
Authorization: Basic YXBpdXNlcjphcGl1c2Vy

You can test the credentials using the following curl command:

curl -v -X GET -H "Authorization: Basic YXBpdXNlcjphcGl1c2Vy" "httpx://<host>:port/api/experimental/test"

From the script output, you will notice the Authorization header is generated as Authorization: Basic YXBpdXNlcjphcGl1c2Vy, along with a sample curl request for the API test (api/experimental/test) endpoint.

Using the native API, the DAG can be triggered by calling the following endpoint and including either the Authorization request header, or username:password:

api/experimental/dags/<DAG_ID>/dag_runs

Method 1: Trigger DAG Using HTTP Authorization Header

curl -X POST \
http://localhost:8080/api/experimental/dags/snpeff-dag/dag_runs \
-H 'Content-Type: application/json' \
-H 'Authorization: Basic YXBpdXNlcjphcGl1c2Vy' \
-d {\"conf\":"{\"job\":$(cat /home/airflow/sample-workflows/snpeff/snpeff-workflow-inputs.json)"}}

Method 2: Trigger DAG using Username/Password

curl -X POST -u apiuser:apiuser \
http://localhost:8080/api/experimental/dags/snpeff-dag/dag_runs \
-H 'Content-Type: application/json' \
-d {\"conf\":"{\"job\":$(cat /home/airflow/sample-workflows/snpeff/snpeff-workflow-inputs.json)"}}

This should trigger the DAG and return a response similar to the following.

{"execution_date":"2020-09-23T09:54:32+00:00",
"message":"Created <DagRun snpeff-dag @ 2020-09-23 09:54:32+00:00: manual__2020-09-23T09:54:32+00:00,
externally triggered: True>",
"run_id":"manual__2020-09-23T09:54:32+00:00"}

Trigger snpeff-dag using the CWL-Airflow API

A request including the URL encoded job can be used to trigger the DAG via CWL-Airflow API endpoint /dag_runs.

This requires access to urlencode, which can be accessed by installing gridsite-clients.

$ sudo apt install gridsite-clients

The below command shows the call to trigger dag_id=snpeff-dag, passing in the URL encoded dag_run.conf job contents within the HTTP request.

$ curl -X POST "http://localhost:8081/api/experimental/dag_runs?\
dag_id=snpeff-dag&\
conf=\
"$(urlencode {\"job\":\
$(cat /home/airflow/sample-workflows/snpeff/snpeff-workflow-inputs.json)})

Trigger snpeff-dag using Airflow CLI

The following command can be used to trigger the DAG using the Airflow CLI.

$ docker exec -ti webserver airflow trigger_dag \
--conf "{\"job\":$(cat /home/airflow/sample-workflows/snpeff/snpeff-workflow-inputs.json)}" \
snpeff-dag

Troubleshooting Workflows

The example workflow discussed above was modified and tested using Rabix Composer before attempting to create and test the corresponding CWL-Airflow DAG. Rabix was configured to use cwltool as the local executor. Testing the workflow prior to porting to a DAG helps in distinguishing whether errors are related to workflow components or the CWL-Airflow implementation.

Workflows may also be tested with cwltool from the host with --debug enabled:

$ cwltool --debug <WORKFLOW> <JOB>

In the case of our snpeff-dag, this translates to command:

$ cwltool --debug \
/home/airflow/sample-workflows/snpeff/snpeff-workflow.cwl \
/home/airflow/sample-workflows/snpeff/snpeff-workflow-inputs.json

TL;DR

  • Create host user airflow and add to groups docker, sudo. All subsequent commands/activities to be performed with the new user, airflow
  • Clone repo to tmp location and move relevant contents to /home/airflow
$ cd /home/airflow
$ mkdir tmp
$ git clone https://github.com/tonys-code-base/cwl-airflow-stack.git /home/airflow/tmp$ rm -fr /home/airflow/tmp/.git /home/airflow/tmp/.gitignore /home/airflow/tmp/README.md$ mv /home/airflow/tmp/* /home/airflow/
$ rm -fr /home/airflow/tmp
  • Build the docker image with tag of cwl-airflow-docker
$ cd /home/airflow/cwl-airflow-docker
$ docker build -t cwl-airflow-docker .
  • Bring up the stack
$ cd /home/airflow
$ docker-compose up -d
  • Refer to Stack Config and Default Credentials section of this article for stack credentials
  • Refer to the example workflow and DAG included in the repo
/home/airflow/dags/snpeff-workflow-dag.py
/home/airflow/sample-workflows/snpeff/*
  • Trigger the sample DAG using preferred method as outlined in section Executing snpeff DAG/Workflow. For example, to trigger using the CWL-Airflow API, first install gridsite-clients followed by a call to the dag_runs endpoint
$ sudo apt install gridsite-clients$ curl -X POST "http://localhost:8081/api/experimental/dag_runs?\
dag_id=snpeff-dag&\
conf=\
"$(urlencode {\"job\":\
$(cat /home/airflow/sample-workflows/snpeff/snpeff-workflow-inputs.json)})
  • Workflow outputs can be located at:
/home/airflow/cwl_outputs_folder/<dag_id>/*

Final Notes

Most of the workflows I’ve stumbled across during the learnings related to CWL have been in the space of bioinformatics. As a developer working with Airflow, and without an bioinformatics background, the chase to find a neat approach to building reusable components led to the discovery of CWL and CWL-Airflow. Personally, I can see immediate benefit in using CWL/CWL-Airflow outside the bioinformatics space.

CWL-Airflow continues to evolve. To gain more of an understanding into CWL/CWL-Airflow, please refer to the following links:

Latest Official CWL-Airflow Documentation:
https://cwl-airflow.readthedocs.io/en/latest/

CWL-Airflow Official Git repo:
https://github.com/Barski-lab/cwl-airflow

CWL-Airflow Official Python package:
https://pypi.org/project/cwl-airflow/

Common Workflow Language portal:
https://www.commonwl.org

CWL Patterns: https://rabix.io/cwl.html

--

--

Tony Tannous
The Startup

Learner. Interests include Cloud and Devops technologies.