Long running job with Cloud Workflows

Published in

Google Cloud - Community

6 min readMar 21, 2022

Serverless services are designed to run real time and interactive workloads: website, REST APIs,… They are not really designed for long-running jobs during hours, days or weeks. There is no dedicated service and the VMs are still the best place to run those workloads.

Recently, Mete Atamel released a solution with Cloud Workflows in a Google Cloud Blog post. The solution is great with a combination of Cloud Workflows, a serverless product, and Compute Engine.

However, it was not totally satisfying for me.

Indeed, the Mete’s solution lets the orchestrator (Cloud Workflow) choose the duration that runs the workload on the VM. His solutions:

Must know the duration of the workload: if the workload finishes earlier, VM time is wasted for nothing. If not, it’s stopped before the end (and the result can’t be wrong)
Must implement a web-server to start and stop the workload
Must open a port on the internet to allow Cloud Workflow to interact with the web-server

How to run a log running job without any ideas of the workload duration and without a web-server?

And, in addition, without opening VM ports to the wild internet.

The callback solution

The Mete’s idea is good: The combination of Cloud Workflow and Compute Engine is the key, but the end of the job has to be controlled by the workload itself. For that, Cloud Workflows propose a useful feature named callback.

This feature is based on the web-hook pattern.

You generate an URL
You provide the URL to the client
You wait for a call to that callback URL by the client

You can add a timeout on the wait and customize the HTTP verb (GET, POST, PUT, DELETE) that you want to listen to. The post is useful to receive data in addition to the callback notification.

Long running job with callback

So, my prototype in GitHub is to leverage that callback feature and to run a container on a VM.

Beyond the useless example that I share, the principle and the workflow logic are the most important here; you can reuse and customize the principle for more powerful solutions!

Creation of the callback

The callback creation requires no special trick. Be aware of the HTTP verb that you expect when the callback is called, especially is you to get a body response in the same time as the callback notification (don’t use GET in that case)

- create_callback:
    call: events.create_callback_endpoint
    args:
      http_callback_method: "GET"
    result: callback_details

Create the Compute Engine

The Compute Engine also doesn’t show any challenge. To know and understand faster the required parameters, I used the Compute Engine creation UI on the console and to the View equivalent feature

And adapt the JSON parameters to YAML definition in Workflow.

However, we don’t have to create a simple VM. We have to customize the startup to run the workload and, when it is finished, to call the Cloud Workflows callback.

So, we need a startup script!

Startup script focus

The startup script is the core of that solution, and the most challenging with the Cloud Workflow constraint.

So what we need:

To know the command to run. We can hardcode it, but I prefer putting it in parameter of the VM, in the attributes metadata
To know the callback to call. Here again, nothing hardcoded and put in parameter of the VM
To be allowed to use the Cloud Workflow callback. So, we have to get an access token from the metadata server and to add it in the security header of the callback’s curl call.

In terms of security, if you use the Compute Engine service account, you have to scope your Compute Engine VM correctly. I used the Cloud Platform scope to avoid any problem (you can skip the scope if you use a custom service account).
In addition, the used service account (default or custom) must have the Workflows Invoker Role granted on it.

So, let’s start. Getting the metadata attributes isn’t a challenge. A call to the metadata server is enough, like that

# Docker Command
curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/docker-command# Callback URL
curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/callback-url

To get the access token is much more challenging. Of course, the metadata server provides the /services-accounts/default/token endpoint, and the answer is in JSON.

curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token

To parse nicely JSON in bash, we can use JQ.
However, for security reasons and to run containers efficiently, I chose to use a COS (container-optimized OS). For security reasons, and to reduce the attack surface, there is a few binaries installed on it, and it’s not possible to install additional libraries.

Note: you can see an additional step named get_latest_image to automatically select the latest valid COS image to stay up-to-date, and increase the security.

But, a solution exists: you can use toolbox. It’s a sandboxed linux environment on which we can install JQ!

toolbox apt-get install -y jq

Then, call the callback with all the piece

toolbox bash -c "curl -s -H \"Metadata-Flavor: Google\" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token | jq -r .access_token"

So now, we have all the building blocks. The second challenge is to write the startup script in Cloud Workflow in a clean and readable manner.

Initially, I chose the multiline expression to elegantly build a string. There where several challenges:

The colon : aren’t allowed in the expression. You must escape the whole expression with single quote '.
Because you have to use double quotes " for the strings, the script line, and the single quote ' for the whole expression, you start to play and to twist your brain with the backslash \ escape character… A nightmare
And when I finally succeeded, the service told me that I’m limited to 400 characters per expression…

So, I forgot that first option and used a old school one. As you can see in the assign step of my code, I defined line by line my script. It’s easy to read, I don’t have to use the backslash \ escape character.

- scriptCreation:
    assign:
      - scriptLine000: '#! /bin/bash'
      - scriptLine001: '$(curl -H "Metadata-Flavor: Google" ...'
      - scriptLine002: 'toolbox apt-get install -y jq'
      - scriptLine003: 'TOKEN=$(toolbox bash -c "curl ...'
      - scriptLine004: 'curl -H "Authorization: Bearer ...'

And finally, the startup script definition is simply a concatenation of the script lines separated by carriage return \n.

- key: "startup-script"
  value: ${scriptLine000 + "\n" +
    scriptLine001 + "\n" +
    scriptLine002 + "\n" +
    scriptLine003 + "\n" +
    scriptLine004
          }

Note: That workflow limitation tells us that it’s not the optimal solution. That long script should be stored elsewhere, on Cloud Storage for instance, and then loaded as-is, or us directly in the startup-script-url

Wait the callback

The wait callback is also a simple step. However, during the test, I recommend using a timeout to automatically cancel the workflow after a few minutes. It prevents you from manually canceling the workflow in case of bugs in your tests.

- await_callback:
    call: events.await_callback
    args:
      callback: ${callback_details}
      timeout: 25920000 #300 days. Max 365
    result: callback_request

Delete the VM

Delete the VM is a call to the compute engine API through the delete connector. That piece of code is a pure copy/paste from Mete’s code!

- delete_vm:
    call: googleapis.compute.v1.instances.delete
    args:
      instance: ${instanceName}
      project: ${projectId}
      zone: ${zone}

You can find the full code and instructions in my GitHub repository if you want to have a try by yourselves!

Serverless long running future

Thanks to Cloud Workflow and Compute Engine, you have a fully managed solution, orchestrated with a serverless service.

You don’t have to manage the VM
You don’t have to patch the VM
You always have the latest version of the COS OS
You pay only when the VM is used. The VM is created and destroyed automatically when the workload ends.

This time, you can run workloads that can take hours, days, weeks, or seconds, without knowing its duration ahead of time. That can be perfect for a ML training job for instance!

Finally, you can run the workload that you want, and not only a container if you customize the command to run and the Compute Engine image to use.