Dataflow Flex Templates

Neil Kolban
Google Cloud - Community
8 min readNov 6, 2022

With Dataflow Flex Templates, we can define a Dataflow pipeline that can be executed either from a request from the Cloud Console, gcloud or through a REST API call. Through the REST API, that then opens up the possibility to start a job through Cloud Scheduler or many other mechanisms.

When I sat down to use Dataflow Flex Templates, I read the documentation and there appears to be quite a few parts to get it going. As I studied further, the models and requirements eventually fell into place. This article is an attempt to simplify and illustrate what one needs to do to get Flex Templates going.

Let us first review what we would need to do to start a Dataflow job without a Flex Template.

A developer writes a Beam pipeline in Java (it could be other languages but for discussion, we’ll focus exclusively on Java). The developer then runs the pipeline. They could run it on their laptop for test/dev or it could be in a Compute Engine or some other GCP environment for production. When the pipeline app is launched, it builds an Apache Beam execution graph that is then sent to Dataflow. Dataflow then executes the graph using its Dataflow Runner engine. For development and test, this workflow is fine but for production there are some issues. First, in order to execute the Dataflow Job, the execution graph has to be sent to Dataflow and that is done by running the pipeline app using an environment which includes JVMs and other libraries. What this means is that the pipeline (as written by the developer) may have one set of dependencies but when it comes time to run the pipeline by others, those dependencies must be exactly replicated. These include the versions of the JVM, the versions of the Beam SDKs and more. This opens up the opportunity for there to be a mismatch. When it comes to a CI/CD pipeline, we are broken. If the developer checks their pipeline code into a source code repository, there is no assurance that the test and prod teams will have the same compile and execution environments.

Next, consider either users or operations staff. Imagine you are told to run a Dataflow Job. You now have to perform your own assembly to build all the prerequisites needed for launching. If you are a user, you may ask “Why can’t I just pick the job from a list and start it?”.

These are just a couple of the puzzles that Flex Templates solves.

Now let us see the model of Flex Templates. I am going to use a progressive disclosure technique for the architecture to aid us in our understanding. Let’s start at the final results and we’ll work backwards.

Flex Templates creates two artifacts for us that allows us to launch a Dataflow job. These are a Docker image and a JSON file that is stored in Google Cloud Storage (GCS).

These are shown as Docker Image and Template File in the following diagram:

Let us now break these down so we can clearly see what is happening. We will start with the core notion that a user wants to run a Dataflow Job. As such, the user will send a request to Google Cloud API Services to say “Start my Dataflow Job”. The way the user actually does this is by naming a JSON file that exists in a GCS bucket that we call the Template File. We will cover how this file is created later. For now, assume that the Template File points to a Docker image that lives in the Artifact Registry (again, where that Docker Image comes from will be discussed later).

The user has just declared “Run my Dataflow Job that is described in this file on GCS”. Google Cloud then looks inside the file and finds the identity of a Docker Image. Google Cloud then spins up a Compute Engine which it calls the launcher and uses the Docker Image as the image to run in that Compute Engine. The Docker Image contains the compiled code of the Apache Beam pipeline that the developer originally wrote. When the launcher Compute Engine eventually starts running, the Docker Image runs and causes a Dataflow execution graph to be created and sent to Dataflow for execution of the Beam pipeline.

Pause here and contemplate this story. If the developer has packaged their Beam pipeline as a Docker Image then wherever it is used (test, prod) it will always be the exact same code and packaged with the correct dependencies. This resolves some of the original puzzles. In addition, to use the Pipeline, the user need only know the identity of a file contained in a GCS bucket as when they launch the job, that’s all they need to specify.

The user can launch a Dataflow job using:

gcloud dataflow flex-template run

or

Method: projects.locations.flexTemplates.launch — REST API

or

All three take the GCS file object as input parameters.

Now we get to turn our attention to the missing pieces of our story. We haven’t yet described what is contained in the template file or how it is created nor have we described how to properly build the Docker Image.

While the Docker Image can be built by hand, we won’t be describing that here. Instead we will describe the simplest way. Google provides a command called:

gcloud dataflow flex-template build

This command takes as input:

  • The GCS path to the template file that will be created
  • The base Docker Image to use as the base for building the Docker Image
  • The artifact repository that will hold the built Docker Image
  • The pre-compiled code and dependencies for the Beam pipeline

When the command is run, it will create both the Docker Image and the corresponding template file.

The overall diagram looks as follows:

By now you are hopefully getting the architecture model which we have split into two segments.

The developer wants to create a Docker Image and a Template File that will eventually be used to launch Dataflow jobs. The users will know the name of the Template File which points to the Docker Image which will be used to launch the eventual Dataflow job.

We will now turn our attention to how to build the JAR that is built by the developer that will be packaged into the Docker Image. This one is tricky. The JAR must contain not only the compiled Beam code but also all the prerequisites needed at exactly the levels used during build. Google supplies a sample Github project that contains a maven pom.xml file for this purpose. Sadly, I am not any kind of a Maven expert so can’t explain how it works … but from a usage standpoint, we execute:

mvn package

and a file is created in the target folder that is called a “fat” or “uber” jar. It is large and contains everything needed for execution.

There is much more that we could say about Flex Templates including:

  • How to pass parameters to the pipeline
  • How to specify default values for the runtime parameters
  • How to work with languages other than Java

… but we have enough now to get us going. What follows is a walk through of the recipe to get Flex templates operational.

  1. Create a project

We create a GCP project for our tests. You can re-use an existing one if you desire.

2. Enable services

In our test, since we created a new project, none of the services are pre-enabled and we will enable the ones we need.

  • Compute Engine
  • Dataflow
  • Artifact Registry
  • Cloud Build

3. Create a GCS bucket

We need to create a GCS bucket that will hold our Flex Template GCS file.

gs://kolban-dataflow6-tmp

4. Create a VPC network

I recommend not using the default VPC network (if you have one) and created a new VPC network called myvpc.

5. Create a service account for the worker

When Dataflow runs, it creates Compute Engines that run as workers. These must run as a service account and here we create a new service account that they will run as. I called mine “worker”.

6. Grant the worker Dataflow Worker

To be able to perform the role of a Dataflow Worker, the newly created service account must be granted Dataflow Worker role.

7. Create a docker repo

A Docker Image will be created and we need a repo in which to store it. We called this repo “myrepo”.

8. Grant the worker Artifact Registry Reader on the repo

The worker service account is the service account that the launcher runs as. It must have permissions to read from the artifact registry. We grant it the reader role.

9. Clone our sample project

We clone our sample Github project:

git clone https://github.com/kolban-google/flex-templates

10. Change into the cloned github project

cd flex-templates

11.Edit the Makefile and change the variables

  • PROJECT_ID
  • BUCKET_NAME

12. Run the code builder

I am assuming that you have Java 11 and Maven installed in your environment.

Run make build-code

This will compile the code and build the fat jar.

13. Run the Flex Template builder

Run make build-flex

This will run the gcloud command that will build the Flex Template. It will consume the fat jar and build both the Docker Image and the Flex Template file in the GCS bucket.

14. Run the Flex Template

Run make run-flex

This will submit a job to Dataflow using the Flex Template

And finally … a Video illustrating some of the concepts of this article:

References

--

--

Neil Kolban
Google Cloud - Community

IT specialist with 30+ years industry experience. I am also a Google Customer Engineer assisting users to get the most out of Google Cloud Platform.