Use Google Cloud Batch for Running WDLs

Kopal Garg
4 min readFeb 22, 2024
Transition from Google’s Cloud Life Sciences API to Google Cloud Batch

Google’s Cloud Life Sciences API is being deprecated. For those relying on it, Google Cloud Batch combined with Cromwell offers a powerful alternative for batch processing workflows.

Here’s a streamlined guide to get started with Google Cloud Batch using Cromwell:

Preparing Your Workflow

  • Workflow Definition (WDL File): hello.wdl contains a simple task to greet an addressee.
task hello {

String addressee

command {
touch test.txt
echo "Hello ${addressee}! Welcome to Cromwell" > test.txt
}
output {
File out = "test.txt"
}
runtime {
docker: "us-central1-docker.pkg.dev/example-project/example-org/example-image:7841b2e"
memory: "1 GB"
cpu: 1
}
}
workflow wf_hello {
String addressee
call hello{
input:
addressee = addressee

}
}
  • Inputs File: hello.inputs specifies the input for the workflow, such as the addressee's name.
{
"wf_hello.hello.addressee": "Kopal"
}
  • Configuration File: google.conf configures Cromwell to use Google Batch, including specifying project details, bucket for outputs, and the Batch backend settings.

backend {
default = GCPBATCH

providers {
GCPBATCH {
actor-factory = "cromwell.backend.google.batch.GcpBatchBackendLifecycleActorFactory"
config {
# Google project
project = "example-project"

# Base bucket for workflow executions
root = "gs://org-batch-output"

# Polling for completion backs-off gradually for slower-running jobs.
# This is the maximum polling interval (in seconds):
maximum-polling-interval = 600

# Optional Dockerhub Credentials. Can be used to access private docker images.
dockerhub {
# account = ""
# token = ""
}

# Optional configuration to use high security network (Virtual Private Cloud) for running jobs.
# See https://cromwell.readthedocs.io/en/stable/backends/Google/ for more details.
# virtual-private-cloud {
# network-label-key = "network-key"
# auth = "application-default"
# }

# Global pipeline timeout
# Defaults to 7 days; max 30 days
# batch-timeout = 7 days

genomics {
# A reference to an auth defined in the `google` stanza at the top. This auth is used to create
# Batch Jobs and manipulate auth JSONs.
auth = "application-default"


// alternative service account to use on the launched compute instance
// NOTE: If combined with service account authorization, both that service account and this service account
// must be able to read and write to the 'root' GCS path
compute-service-account = "default"

# Location to submit jobs to Batch and store job metadata.
location = "us-central1"

# Specifies the minimum file size for `gsutil cp` to use parallel composite uploads during delocalization.
# Parallel composite uploads can result in a significant improvement in delocalization speed for large files
# but may introduce complexities in downloading such files from GCS, please see
# https://cloud.google.com/storage/docs/gsutil/commands/cp#parallel-composite-uploads for more information.
#
# If set to 0 parallel composite uploads are turned off. The default Cromwell configuration turns off
# parallel composite uploads, this sample configuration turns it on for files of 150M or larger.
parallel-composite-upload-threshold="150M"
}

filesystems {
gcs {
# A reference to a potentially different auth for manipulating files via engine functions.
auth = "application-default"
# Google project which will be billed for the requests
project = "google-billing-project"

caching {
# When a cache hit is found, the following duplication strategy will be followed to use the cached outputs
# Possible values: "copy", "reference". Defaults to "copy"
# "copy": Copy the output files
# "reference": DO NOT copy the output files but point to the original output files instead.
# Will still make sure than all the original output files exist and are accessible before
# going forward with the cache hit.
duplication-strategy = "copy"
}
}
}

default-runtime-attributes {
cpu: 1
failOnStderr: false
continueOnReturnCode: 0
memory: "2048 MB"
bootDiskSizeGb: 10
# Allowed to be a String, or a list of Strings
disks: "local-disk 10 SSD"
noAddress: false
preemptible: 0
zones: ["us-central1-a"]
}

}
}
}
}

Running Your Workflow with Cromwell

java -Dconfig.file=google.conf -jar cromwell-86.jar run hello.wdl -i hello.inputs

This command uses Cromwell to run your hello.wdl workflow, specifying hello.inputs as the input file and google.conf for configuration.

The Issue with the Latest Cromwell Release

However, when attempting to run workflows with the most recent version of Cromwell, users encountered an “invalid Docker spec” error due to a change in how GCP Batch backend processed mount paths. This issue, detailed in a GitHub thread, prevented workflows from executing successfully.

Workaround and Solution

To circumvent this, the Cromwell team provided fixes in subsequent development versions, addressing the Docker spec issue. We found success by pulling the latest Cromwell Docker image, which includes these fixes:

Pull the Latest Cromwell Docker Image:

docker pull broadinstitute/cromwell:latest

Execute the Workflow:

docker run -it --rm -v $(pwd)/batch:/batch --entrypoint /bin/bash broadinstitute/cromwell:latest -c "java -Dconfig.file=/batch/config/google.conf -jar /app/cromwell.jar run /batch/wdls/hello.wdl -i /batch/hello.inputs"

Until the next Cromwell release, this approach ensures smooth workflow execution on Google Cloud Batch, leveraging the latest developments and community support.

--

--

Kopal Garg

Data Scientist and ML Engineer @Cartography | MSc in Computer Science UofT and Vector Institute | Engineering uWaterloo