Cromwell “Hello GCP”

Hatem Nawar
Google Cloud - Community
8 min readOct 3, 2021

Background

Cromwell is one of the most popular Workflow engines for Bioinformatians. This tutorial attempts to help researchers and IT teams have a basic deployment that follows best practice and is production ready. It also discusses some best practices to selecting where to run your workflows and which location to use for the Life Sciences API.

This tutorial will also include some useful tricks, like how to use IAP to access the Cromwell API server without using Public IP address in a more secure way, and enabling Private Google Access and creating NAT Gateway to enable the use of private IP addresses for worker VMs.

Architecture

This guide follows the same architecture outlined in https://cloud.google.com/architecture/genomic-data-processing-reference-architecture

Key Variables and decisions

Before you start your deployment, you will need to decide on a few things, like where will the Cromwell server be, and type of SQL instance you will use. This is a list of variables and values that we will use across the deployment and in the config file. You will need to think carefully about where to create these resources as it may affect performance.

  1. Project id <project-id>
  2. Life Sciences API location <location>
  3. Zone to create the the DB, Cromwell Server, etc <zone>
  4. Zones for worker VMs <zones>
  5. VM used for Cromwell server <cromwell-vm>
  6. MySQL instance <cromwell-db>
  7. MySQL username <db-username> and password<db-password>
  8. GCS Storage Bucket <gcs-bucket>

Provisioning MySQL

There are many advantages to running Cromwell with MySQL such as being able to run in server mode and submit multiple jobs, sharing output, being able to view timing charts, resuming failed pipelines, etc.

The instance type and disk size depends on the number of parallel pipelines expected to run. In this tutorial I start with a n1-standard-1 instance type and 20GB SSD disk which I found to be more than enough to run a few parallel pipelines. You can always change that later but it will require a restart.

  1. Browse to Cloud SQL and Create Instance

2. Choose MySQL, you may need to enable the API

3. Change to Single Zone availability unless there is a need for high availability

4. Update the Region/Zone to match Cromwell server location

5. Click on Show Configuration options

6. Update Machine Type

7. Update Storage and resize if needed

8. Expand Connections and uncheck Public IP and select Private IP

9. Select the Network where the Server will be running, usually default

10. You may need to set-up Private service access connection if you have not done this before for this VPC, Click Enable API and select ‘Use an automatically allocated IP range. Click Continue, then Create Connnection

11. Click on create instance

12. Once the database has been created, click on the instance, then Databases and create a new database “cromwell”

13. Note the private ip address <db-ipaddress> as it will be used later in the config file

Creating NAT and configuring Private Google Access and Firewall

Private Google Access

  1. In VPC networks select the subnet for the <region> where the cromwell server and worker nodes will be provisioned

2. Click ‘EDIT’

3. Change ‘Private Google Access’ to ‘On’

4. Click Save

Add Firewall rule to allow access from IAP IP range

This firewall will be used later to allow traffic to Identity Aware Proxy service for secure access.

  1. Click on Firewall
  2. Click ‘Create a Firewall rule’
  3. Add appropriate name e.g. allow-cromwell-iap-access
  4. Make sure the correct network is selected
  5. In Target tags add “cromwell-iap”
  6. In source filters add the cider range “35.235.240.0/20”
  7. Select TCP and add 8000
  8. Click Create

Create Cloud NAT

  1. Click on Network Services→Cloud NAT
  2. Click ‘CREATE NAT GATEWAY’
  3. Add name e.g. cromwell-nat
  4. Select network (usually default) and region <region>
  5. Click on Cloud router then create new router

6. Add router name eg. cromwell-nat-router

7. Click Create then Click Create

Create Service account

  1. Click on IAM & Admin→Service Accounts
  2. Click “Create Service Account”
  3. Enter Service account name e.g “cromwell-sa”
  4. Add appropriate description
  5. Click “Create and Continue”
  6. Add the following roles
  7. Cloud SQL Admin
  8. Cloud Life Sciences Workflows Runner
  9. Service Usage Consumer
  10. Storage Object Admin
  11. Click Done
  12. Click on the service account then select “KEYS” tab
  13. Click Add Key → Create new key
  14. Click Create, download the key to a secure location and rename to credentials.json

Create Storage Bucket

Create a regional standard Cloud Storage Bucket, it should be placed in the same region as the worker nodes. You can also consider using Dual or multi regions if the workers will be in multiple regions.

Enable Life Sciences API

In the search box at the top search for Life Sciences API, select the API then Click on the Enable button

Provisioning Cromwell Server

I usually start with an e2-standard-4 instance and later adjust the size based on recommendation on the console. Smaller instances will still work as well.

  1. Create new VM cromwell-server
  2. Set the zone to <zone>
  3. Select appropriate size based on expected number of workflows, you can start with e2-standard-4
  4. Update Access scope to Allow full access to all Cloud APIs
  5. Expand Networking and add a network tag “cromwell-iap”
  6. Expand the network interface and change External IP address to ‘None’
  7. Click Create

Deploying and configuring Cromwell

Downloading cromwell, jdk and other tools

  1. SSH to the Cromwell server from the console or using gcloud
  2. Run the following
sudo apt updatesudo apt install -y git wget openjdk-11-jdk

3. Check the latest cromwell release and download the jar file

wget https://github.com/broadinstitute/cromwell/releases/download/69/womtool-69.jarwget https://github.com/broadinstitute/cromwell/releases/download/69/cromwell-69.jar

4. Rename downloaded file

mv cromwell-69.jar cromwell.jar

Creating conf file

I created a sample conf file that addresses many issues that may not be well documented.

  1. Download the sample conf file
  2. Edit the following value
  3. project-id
  4. service account email
  5. storage-bucket
  6. Location and endpoint-url if not using us-central1
  7. zones
  8. database IP and password
  9. Upload the conf file and the credentials.json created earlier to the VM, for more information about transferring files to VM check out this guide.

Testing the configuration and running hello.wdl

This will run Cromwell in standalone mode. This is helpful to detect any issues with the configuration and debug them before running in server mode.

  1. Create a file hello.wdl with the following content
workflow myWorkflow {
call myTask
}
task myTask {
command {
echo “hello GCP”
}
runtime {
docker: “ubuntu:latest”
}
output {
String out = read_string(stdout())
}
}

2. Run the workflow, you can also check the console and observe as a worker VM is created to run the task. You should see “hello GCP” in the output.

java -Dconfig.file=PAPIv2.conf -jar cromwell.jar run hello.wdl

Creating and starting cromwell service

  1. Using your favourite text editor as root create a file /etc/systemd/system/cromwell.service , for example
sudo vi /etc/systemd/system/cromwell.service

2. Copy and paste the example below, you adjust the size of heap memory (Xmx**G) based on your VM shape

[Unit]
Description=Cromwell Server
After=network.target
[Service]
User=root
Group=root
Restart=always
TimeoutStopSec=10
RestartSec=5
WorkingDirectory=/home/<linux_user>
ExecStart=/usr/bin/java -Xmx10G -Dconfig.file=PAPIv2.conf -jar cromwell.jar server
[Install]
WantedBy=multi-user.target

3. Exit and save

4. Reload linux daemons and start the service

sudo systemctl daemon-reloadsudo systemctl start cromwell.service

5. Make it start automatically with server reboots

sudo systemctl enable cromwell.service

6. You can check the service status by running

sudo systemctl status cromwell.service

Type ‘q’ to quit

Submitting your hello.wdl

Now that we have a server running,we will run hello.wdl again but this time using submit instead of run. This will submit the wdl to the server running already as a service instead of starting a new instance of cromwell

java -Dconfig.file=PAPIv2.conf -jar cromwell.jar submit hello.wdl

Using IAP to connect to cromwell server

The following requires that you have the Google Cloud SDK installed on your device. If you are using a Chromebook you can do that on Linux Shell (Crostini), note that ChromeOS automatically forwards port 8080 to the Linux VM

From the terminal with gcloud configured to the correct project run

gcloud compute start-iap-tunnel <cromwell-vm> 8000 --local-host-port=localhost:8080 --zone=<zone>

Now open your web browser and browser to localhost:8080 that should open the swagger page of your cromwell server, now let’s also try accessing it from CLI , try running

curl -X GET “http://localhost:8080/api/workflows/v1/query" -H “accept: application/json”

This should return a list of workflows that have run already including the hello.wdl we ran earlier. You can check the other available APIs such as the workflow timing which is very useful when debugging or optimizing a workflow

Other best practices

Running in Europe

  • Make sure your resources and end points are all in the EU or desired region, this includes Cromwell VM, Cloud SQL, Bucket, Location etc, this helps improve performance and reduce latency.
  • Many of the reference files are hosted in the US, this can increase the overhead for localization time and may increase cost for files hosted with requester pay as you will pay for egress from US to EU
  • Ensure endpoint-url is configured correctly to match the location

Using Container registry

  • Using Docker HUB or other public registries may increase the overhead, especially with some of these services applying throttling when multiple pull requests are made to the same container. So pushing the container to the Container Registry, pushing the container to the nearest regional registry can give your workflow a performance boost
  • Also please note pulling containers over Cloud NAT can incur NAT ingress charges if pulled from external registry or if Private Google Access is not configures
  • Similar to the reference files, you may incur additional costs when pulling containers from remote regions

Cleaning up Execution Directory

Execution directories can grow big very quickly, make sure you copy the output to a separate location and clean up intermediate files when no longer needed.

You can also set the delete_intermediate_output_files flag in the configuration file to true which will delete any files not mentioned in the final output of the workflow. Checkout the Cromwell documentation for more information about this and other Google specific options

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Hatem Nawar
Hatem Nawar

Written by Hatem Nawar

Hatem Nawar is a Google Cloud Customer Engineer focusing on Life Sciences & Genomics.

No responses yet