How to build and execute a Dataflow Flex Template (Java)
Dataflow templates allow you to package your Dataflow pipeline for deployment. Dataflow supports two kinds of templates — classic and flex. For building your own template, it’s recommended to use Flex templates.
This tutorial provides a guide and code snippets for building and executing a basic flex template in Java. We will be building and executing the sample getting-started pipeline from the open-source java-docs-repo.
Pre-requisites
Pre-requisite 1: Set gcloud auth credentials
export PROJECT_ID=<GCP_PROJECT_ID>
gcloud config set project ${PROJECT_ID}
gcloud auth login
gcloud auth application-default login
Pre-requisite 2: Create an Artifact Registry repo for storing the template image
export REPO_NAME=<ARTEFACT_REGISTRY_REPO_NAME>
export LOCATION=<ARTEFACT_REGISTRY_REGION>
gcloud artifacts repositories create ${REPO_NAME} \
--repository-format=docker \
--location=${LOCATION}
gcloud auth configure-docker ${LOCATION}-docker.pkg.dev
This repo will contain the docker image of the template.
Pre-requisite 3: Create a Cloud Storage bucket for storing the template path
export BUCKET_NAME=<GCS_BUCKET_NAME>
gcloud storage buckets create gs://${BUCKET_NAME}
This bucket will contain the specification file, which will contain the template path (path of the template’s Docker image).
Pre-requisite 4: Make sure Maven is installed and added to PATH
brew install maven
mvn --version
Building and Executing a Dataflow Flex template
As mentioned earlier, we will be building and executing the sample getting-started pipeline. This pipeline simply writes “1, 2, 3, 4” to a Cloud Storage bucket.
Step 1: Create an output Cloud Storage bucket
export OUTPUT_BUCKET=<OUTPUT_GCS_BUCKET_NAME>
gcloud storage buckets create gs://${OUTPUT_BUCKET}
Step 2: Clone into the sample project and export pipeline jar
git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
cd java-docs-samples/dataflow/flex-templates/getting_started
mvn clean package
Running mvn clean package
will create a folder named target
and create an UBER JAR inside it (named flex-template-getting-started-1.0.jar
).
Step 3: Give a name to the specification file and export it in a variable
export SPECIFICATION_FILE=<SPECIFICATION_FILE_NAME>
Step 4: Build the template
gcloud dataflow flex-template build gs://${BUCKET_NAME}/${SPECIFICATION_FILE}.json \
--image-gcr-path ${LOCATION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${SPECIFICATION_FILE}:latest \
--sdk-language "JAVA" \
--flex-template-base-image JAVA11 \
--metadata-file "metadata.json" \
--jar "target/flex-template-getting-started-1.0.jar" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="com.example.dataflow.FlexTemplateGettingStarted"
The above command will perform the following activities:
- Using the pipeline jar and the main class (specified using the
--jar
and--env FLEX_TEMPLATE_JAVA_MAIN_CLASS
flags), it will create a docker image of our pipeline and upload it to the artefact repo (location of the repo is specified using the--image-gcr-path
flag) - Create a specification file in the Cloud Storage bucket (that we created in Pre-requisite 3). This file will contain the path of our template’s docker image and related metadata.
The metadata.json
file contains the list of all the parameters that the pipeline needs. These parameters will be passed at run time (using the — parameters
flag) when we execute the pipeline.
In our case, the metadata.json file contains only one parameter named output
.
Step 5: Execute the pipeline
gcloud dataflow flex-template run "dataflow-sample-flex-pipeline-`date +%Y%m%d-%H%M%S`" \
--template-file-gcs-location gs://${BUCKET_NAME}/${SPECIFICATION_FILE}.json \
--parameters output=gs://${OUTPUT_BUCKET}/output- \
--region ${LOCATION}
The above command will execute the pipeline and start a dataflow job (name starting with dataflow-sample-flex-pipeline
).
Monitor the job from the Dataflow Console and the OUTPUT_BUCKET
. A file will get created in that Cloud Storage bucket containing the following output:
References:
- Dataflow templates
- Build and Run a flex template
- Sample getting-started-pipeline
gcloud dataflow flex-template build
command documentationgcloud dataflow flex-template run
command documentation