Using Terraform to trigger Dataflow? Here is how to pass any Dataflow runner option to your job using Flex Templates

Israel Herraiz
Jan 25 ยท 4 min read
Photo by SpaceX on Unsplash

Among the many different methods that you can use to trigger Dataflow jobs, Terraform counts with some popularity. Terraform uses the Dataflow API to trigger Dataflow jobs. This means that you can only launch templates. But thanks to Dataflow Flex Templates, this actually does not impose a limit on the kind of job you can launch. If your code is not a template, just wrap it with a launch container and use Flex Templates.

However, not all the picture is as rosy. If you compare the options to deploy a Dataflow pipeline with what the Terraform resources offer (for regular templates, for flex templates), you will notice that there is a gap between what the Dataflow runner can accept as options, and what you can pass using Terraform.

How can we bridge this gap?

There are several ways to pass options to your Dataflow pipeline. The command line is not the only way. You can always set the runner options in a programmatic fashion. For that, you can use the class DataflowPipelineOptions (in Java) or GooogleCloudOptions (in Python).

How do you use those classes to set Dataflow options? The problem is that many of those Dataflow options are not exposed using the Terraform resource. But you can set them from your code usingDataflowPipelineOptions or GoogleCloudOptions .

In Python, we can use view_as to do a "casting" to GoogleCloudOptions :

In Java, we need to make some more changes.

First, if you are using the default pom.xml file provided by the Apache Beam Quickstart, you will need to make sure that the Dataflow dependency is available at compile time. Make sure to include it in your <dependencies> section:

<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>${beam.version}</version>
</dependency>

Once you have that dependency, you should be able to use DataflowPipelineOptions in your pipeline. You can cast your regular PipelineOptions to DataflowPipelineOptions and set the Dataflow parameters:

With those examples, you can set any of the Dataflow runner options included in this table: https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-dataflow-pipeline-options

Yes, great, but all the code snippets have the Dataflow options hardcoded, and you are wondering how to set those at runtime.

If you are writing a regular template, you can't.

With a regular template, your pipeline options will be ValueProviders and you will need to try to "get" them (calling the get method) before the pipeline is created. That will give you an error, get only works when your pipeline is already running.

But worry not, as you can solve this problem with Dataflow Flex Templates.

In a Flex Template you can have "normal" runtime options (not ValueProvider ), and you can just write your code as a "normal" job, without having to deal with anything related to templates. Once you have written your code, you need to wrap it with a launcher container.

The Dataflow documentation has an example of how to transform your code to a Flex Template, with all the details about the containers provided by Google: https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates

With a Flex Template, you could expose a custom option, such as --i-want-streaming-engine , get its value, and then pass it to the Dataflow options object.

Then, in Terraform, you would specify that option in your parameters section:

You could update your options to read your custom option, and then relay their values to the corresponding Dataflow option.

For instance, in Python, we could do something like the following:

And the equivalent in Java would be:

Dataflow changes rapidly, with new features and options available with certain frequency. Not all these options are available immediately from the Terraform resources and modules to be used with your Dataflow templates. However, you can always set those options programatically from your code.

If you are using regular Dataflow templates, these runner options have to be hardcoded in your source code. That's beacuse the pipeline options are available only through value providers, which cannot be recovered at runtime, only after the pipeline has already been launched.

But this is not a problem if you use Dataflow Flex Templates. Just add a normal custom option to your pipeline, and write some code to relay those values to the Dataflow options objects. Then just set your custom options in the parameters section of your google_dataflow_flex_template_job and you will be able to use any Dataflow option, no matter if they are available through Terraform or not.

Google Cloud - Community

Google Cloud community articles and blogs

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Israel Herraiz

Written by

Strategic Cloud Engineer @ Google

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.