How to build and execute a Dataflow Flex Template (Java)

Abhishek Pratap Singh
Google Cloud - Community
3 min readJun 17, 2024

Dataflow templates allow you to package your Dataflow pipeline for deployment. Dataflow supports two kinds of templates — classic and flex. For building your own template, it’s recommended to use Flex templates.

This tutorial provides a guide and code snippets for building and executing a basic flex template in Java. We will be building and executing the sample getting-started pipeline from the open-source java-docs-repo.

Dataflow logo

Pre-requisites

Pre-requisite 1: Set gcloud auth credentials

export PROJECT_ID=<GCP_PROJECT_ID>
gcloud config set project ${PROJECT_ID}
gcloud auth login
gcloud auth application-default login

Pre-requisite 2: Create an Artifact Registry repo for storing the template image

export REPO_NAME=<ARTEFACT_REGISTRY_REPO_NAME>
export LOCATION=<ARTEFACT_REGISTRY_REGION>
gcloud artifacts repositories create ${REPO_NAME} \
--repository-format=docker \
--location=${LOCATION}
gcloud auth configure-docker ${LOCATION}-docker.pkg.dev

This repo will contain the docker image of the template.

Pre-requisite 3: Create a Cloud Storage bucket for storing the template path

export BUCKET_NAME=<GCS_BUCKET_NAME>
gcloud storage buckets create gs://${BUCKET_NAME}

This bucket will contain the specification file, which will contain the template path (path of the template’s Docker image).

Pre-requisite 4: Make sure Maven is installed and added to PATH

brew install maven
mvn --version

Building and Executing a Dataflow Flex template

As mentioned earlier, we will be building and executing the sample getting-started pipeline. This pipeline simply writes “1, 2, 3, 4” to a Cloud Storage bucket.

Step 1: Create an output Cloud Storage bucket

export OUTPUT_BUCKET=<OUTPUT_GCS_BUCKET_NAME>
gcloud storage buckets create gs://${OUTPUT_BUCKET}

Step 2: Clone into the sample project and export pipeline jar

git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
cd java-docs-samples/dataflow/flex-templates/getting_started
mvn clean package

Running mvn clean package will create a folder named target and create an UBER JAR inside it (named flex-template-getting-started-1.0.jar).

Step 3: Give a name to the specification file and export it in a variable

export SPECIFICATION_FILE=<SPECIFICATION_FILE_NAME>

Step 4: Build the template

gcloud dataflow flex-template build gs://${BUCKET_NAME}/${SPECIFICATION_FILE}.json \
--image-gcr-path ${LOCATION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${SPECIFICATION_FILE}:latest \
--sdk-language "JAVA" \
--flex-template-base-image JAVA11 \
--metadata-file "metadata.json" \
--jar "target/flex-template-getting-started-1.0.jar" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="com.example.dataflow.FlexTemplateGettingStarted"

The above command will perform the following activities:

  • Using the pipeline jar and the main class (specified using the --jarand --env FLEX_TEMPLATE_JAVA_MAIN_CLASS flags), it will create a docker image of our pipeline and upload it to the artefact repo (location of the repo is specified using the --image-gcr-path flag)
  • Create a specification file in the Cloud Storage bucket (that we created in Pre-requisite 3). This file will contain the path of our template’s docker image and related metadata.

The metadata.json file contains the list of all the parameters that the pipeline needs. These parameters will be passed at run time (using the — parameters flag) when we execute the pipeline.

In our case, the metadata.json file contains only one parameter named output.

Step 5: Execute the pipeline

gcloud dataflow flex-template run "dataflow-sample-flex-pipeline-`date +%Y%m%d-%H%M%S`" \
--template-file-gcs-location gs://${BUCKET_NAME}/${SPECIFICATION_FILE}.json \
--parameters output=gs://${OUTPUT_BUCKET}/output- \
--region ${LOCATION}

The above command will execute the pipeline and start a dataflow job (name starting with dataflow-sample-flex-pipeline).

Monitor the job from the Dataflow Console and the OUTPUT_BUCKET. A file will get created in that Cloud Storage bucket containing the following output:

Output file created by the Dataflow pipeline

References:

--

--