CI CD for Dataflow Java with Flex Templates and Cloud Build

Mazlum Tosun
Google Cloud - Community
10 min readMar 13, 2023

1. Explanation of the use case presented in this article

The goal of this article is showing a complete use case with a CI CD pipeline to :

  • Build Apache Beam Java job
  • Launch unit tests
  • Deploy Apache Beam / Dataflow job
  • Give the possibility to launch the job

All these steps will be orchestrated with Cloud Build.

The deployment of the Beam and Dataflow job is based on Dataflow Flex Template, that is a way to standardize the deployment of Dataflow pipelines based on a Docker image.

There are two approaches to deploy a Flex Template :

  • Use a Dockerfile with all the dependencies installed in the container, then create the spec file to a Cloud Storage bucket. In this case the flex-template build command will only create the spec file in GCS.
  • Do not use a Dockerfile and let the flex-template build command to generate the Docker image and create the spec file in GCS

In the Google Cloud official documentation, we didn’t saw an example with a Dockerfile for the Java SDK. The method with a Dockerfile is very interesting because we have more flexibility to create the environement with all the needed dependencies, that’s why we wanted to show these two methods.

I also created a video on this topic in my GCP Youtube channel, feel free to subscribe to the channel to support my work for the Google Cloud community :

English version

French version

2. Structure of the project

  • The src folder contains the Beam Java job
  • The config folder contains the Flex Template config and metadata
  • The scripts folder contains a Shell scripts to deploy and run the Dataflow job and Template
  • dataflow-run-tests.yaml Cloud Build file to launch unit tests
  • dataflow-deploy-template.yaml Cloud Build file to generate and build the Flex Template Docker image and create the spec file to Cloud Storage without a Dockerfile
  • dataflow-deploy-template-dockerfile-all-dependencies.yaml Cloud Build file to build the Flex Template Docker image and create the spec file to Cloud Storage. This example contains a Dockerfile with all the needed dependencies installed in the container
  • dataflow-run-job.yaml Cloud Build file to run the template and Dataflow job
  • Dockerfile used by the Cloud Build file with a Dockerfile and all the dependencies installed in the container. The image is published in Artifact Registry
  • pom.xml the Maven pom file to build the Beam project, run unit tests and generate the fat jar

3. Set environement variables

Before to create the Cloud Build Triggers, set the following environement variables from your Shell session :

export PROJECT_ID={{project_id}}
export LOCATION={{location}}

export REPO_NAME=internal-images
export IMAGE_NAME="dataflow/team-league-java"
export IMAGE_TAG=latest
export METADATA_FILE="config/metadata.json"
export METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json"
export SDK_LANGUAGE=JAVA
export FLEX_TEMPLATE_BASE_IMAGE=JAVA11
export JAR=target/teams-league-0.1.0.jar
export FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp"
export JOB_NAME="team-league-java"

export TEMP_LOCATION=gs://mazlum_dev/dataflow/temp
export STAGING_LOCATION="gs://mazlum_dev/dataflow/staging"
export SA_EMAIL={{your_sa_email}}
export INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_teams_stats_raw.json"
export SIDE_INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_team_slogans.json"
export TEAM_LEAGUE_DATASET=mazlum_test
export TEAM_STATS_TABLE=team_stat
export JOB_TYPE=team_league_java_ingestion_job
export FAILURE_OUTPUT_DATASET=mazlum_test
export FAILURE_OUTPUT_TABLE=job_failure
export FAILURE_FEATURE_NAME=team_league

4. Flex Template with a Dockerfile and the dependencies installed in the container

4.1 The Dockerfile

The Flex Template is based on a Docker image to start the Dataflow job

In this example, we use a Dockerfile and install all the dependencies in the container.

Note that the Flex Template container built using the Dockerfile is used only to create a job graph and start the Dataflow job. The packages installed in Flex Template containers are not available in the Beam container.

FROM maven:3.8.6-openjdk-11-slim AS build
ADD . /app
WORKDIR /app
USER root
RUN mvn clean package -Dmaven.test.skip=true

FROM gcr.io/dataflow-templates-base/java11-template-launcher-base

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY --from=build /app/target/teams-league-0.1.0.jar ${WORKDIR}/target/

ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp"
ENV FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/target/teams-league-0.1.0.jar"

ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]

The first stage build the needed dependencies and create the fat jar.

The second stage copies the fat jar from the first stage to the working directory of the second stage.

We need to specify :

  • The Java Main class with FLEX_TEMPLATE_JAVA_MAIN_CLASS env variable
  • The Java classpath targeting on the fat jat with FLEX_TEMPLATE_JAVA_CLASSPATH

4.2 Build the Dockerfile with all the dependencies and create the Flex Template spec file with Cloud Build

Here is the use case diagram of this method :

In this case, there is a Dockerfile with all the dependencies installed in the container.

The dataflow-deploy-template-dockerfile-all-dependencies.yaml Cloud Build file allows to build and publish the Dockerfile and image to Artifact Registry and create the spec file in GCS :

steps:
- name: google/cloud-sdk:420.0.0-slim
entrypoint: 'bash'
args:
- '-c'
- |
./scripts/build_image_with_dockerfile.sh \
&& ./scripts/create_flex_template_spec_file_gcs.sh
env:
- 'PROJECT_ID=$PROJECT_ID'
- 'LOCATION=$LOCATION'
- 'REPO_NAME=$_REPO_NAME'
- 'IMAGE_NAME=$_IMAGE_NAME'
- 'IMAGE_TAG=$_IMAGE_TAG'
- 'METADATA_TEMPLATE_FILE_PATH=$_METADATA_TEMPLATE_FILE_PATH'
- 'SDK_LANGUAGE=$_SDK_LANGUAGE'
- 'METADATA_FILE=$_METADATA_FILE'

The step is created from gcloud SDK Docker image and two bash scripts are executed :

build_image_with_dockerfile.sh :

#!/usr/bin/env bash

set -e
set -o pipefail
set -u

echo "#######Building Dataflow Flex Template Docker image with all the dependencies installed inside"

gcloud builds submit --tag "$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$IMAGE_TAG" .

This script builds and publishes the Docker image to artifact registry with gcloud builds submit command.

create_flex_template_spec_file_gcs.sh :

#!/usr/bin/env bash

set -e
set -o pipefail
set -u

echo "#######Creating image and spec file with flex-template-build"

gcloud dataflow flex-template build "$METADATA_TEMPLATE_FILE_PATH" \
--image "$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$IMAGE_TAG" \
--sdk-language "$SDK_LANGUAGE" \
--metadata-file "$METADATA_FILE"

⚠️ this script only creates the Flex Template spec file in the Cloud Storage bucket.

The path is given by the METADATA_TEMPLATE_FILE_PATH env variable, example :

METADATA_TEMPLATE_FILE_PATH=gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.jsonThis Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :

This Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :

gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="deploy-dataflow-template-team-league-java-dockerfile" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-deploy-template-dockerfile-all-dependencies.yaml" \
--substitutions _REPO_NAME="internal-images",_IMAGE_NAME="dataflow/team-league-java",_IMAGE_TAG="latest",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_SDK_LANGUAGE="JAVA",_METADATA_FILE="config/metadata.json" \
--verbosity="debug"
  • External variables are passed with substitutions in Cloud Build
  • Then in the YAML file, env variables are created with substitutions as values

5. Flex Template without a Dockerfile and generation of the Docker image with flex-template command

Here is the use case diagram of this method :

In this case, there is no Dockerfile , the Docker image and spec file are generated by the flex-template build command

The dataflow-deploy-template.yaml Cloud Build file is used :

steps:
- name: maven:3.8.6-openjdk-11-slim
script: |
mvn clean package
- name: google/cloud-sdk:420.0.0-slim
args: [ './scripts/build_image_and_create_spec_file_with_template_build.sh' ]
env:
- 'PROJECT_ID=$PROJECT_ID'
- 'LOCATION=$LOCATION'
- 'REPO_NAME=$_REPO_NAME'
- 'IMAGE_NAME=$_IMAGE_NAME'
- 'IMAGE_TAG=$_IMAGE_TAG'
- 'METADATA_TEMPLATE_FILE_PATH=$_METADATA_TEMPLATE_FILE_PATH'
- 'SDK_LANGUAGE=$_SDK_LANGUAGE'
- 'FLEX_TEMPLATE_BASE_IMAGE=$_FLEX_TEMPLATE_BASE_IMAGE'
- 'METADATA_FILE=$_METADATA_FILE'
- 'JAR=$_JAR'
- 'FLEX_TEMPLATE_JAVA_MAIN_CLASS=$_FLEX_TEMPLATE_JAVA_MAIN_CLASS'

The fat jar is generated in the first step and given to the flex-template command.

⚠️ the fat jar should be built with the same sdk and Java version used in the Flex Template (Docker image and Template creation). JAVA11 is used everywhere.

The second step executes the script build_image_and_create_spec_file_with_template_build.sh :

#!/usr/bin/env bash

set -e
set -o pipefail
set -u

echo "#######Creating Dataflow Flex Template"

gcloud dataflow flex-template build "$METADATA_TEMPLATE_FILE_PATH" \
--image-gcr-path "$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$IMAGE_TAG" \
--sdk-language "$SDK_LANGUAGE" \
--flex-template-base-image "$FLEX_TEMPLATE_BASE_IMAGE" \
--metadata-file "$METADATA_FILE" \
--jar "$JAR" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="$FLEX_TEMPLATE_JAVA_MAIN_CLASS"
  • This command generate the Docker image unlike in the previous section
  • The fat jar is given with jar parameter : target/team-league-0.1–0.jar
  • The flex-template-base-image is mandatory to generate the Docker image : JAVA11
  • The FLEX_TEMPLATE_JAVA_MAIN_CLASS env variable must be passed : fr.groupbees.application.TeamLeagueApp
  • All the other parameters are mandatories like in the previous section : image-gcr-path sdk-language metadata-file
  • The command also create the Flex Template spec file in GCS

This Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :

gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="deploy-dataflow-template-team-league-java" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-deploy-job.yaml" \
--substitutions _REPO_NAME="internal-images",_IMAGE_NAME="dataflow/team-league-java",_IMAGE_TAG="latest",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_SDK_LANGUAGE="JAVA",_FLEX_TEMPLATE_BASE_IMAGE="JAVA11",_METADATA_FILE="config/metadata.json",_JAR="target/teams-league-0.1.0.jar",_FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp" \
--verbosity="debug"

6. The Beam job

The src folder contains the Beam job built by the pom.xml file.

This job is the same as I presented in the Beam Summit :

7. Run unit tests

The dataflow-run-tests.yaml Cloud Build file allows to launch unit tests with Maven :

steps:
- name: maven:3.8.6-openjdk-11-slim
script: |
mvn test
  • The step is from maven official image.
  • The mvn test executes the unit tests.

This Cloud Build job is launched when changes are pushed in the Github repository on any branch, via automatic Trigger :

gcloud beta builds triggers create github \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="launch-dataflow-unit-tests-team-league-java" \
--repo-name=dataflow-java-ci-cd \
--repo-owner=tosun-si \
--branch-pattern=".*" \
--build-config=dataflow-run-tests.yaml \
--include-logs-with-status \
--verbosity="debug"

8. Run the Flex Template and Dataflow job

The dataflow-run-job.yaml Cloud Build file allows to run the Flex Template and the job :

steps:
- name: google/cloud-sdk:420.0.0-slim
args: [ './scripts/run_dataflow_job.sh' ]
env:
- 'PROJECT_ID=$PROJECT_ID'
- 'LOCATION=$LOCATION'
- 'JOB_NAME=$_JOB_NAME'
- 'METADATA_TEMPLATE_FILE_PATH=$_METADATA_TEMPLATE_FILE_PATH'
- 'TEMP_LOCATION=$_TEMP_LOCATION'
- 'STAGING_LOCATION=$_STAGING_LOCATION'
- 'SA_EMAIL=$_SA_EMAIL'
- 'INPUT_FILE=$_INPUT_FILE'
- 'SIDE_INPUT_FILE=$_SIDE_INPUT_FILE'
- 'TEAM_LEAGUE_DATASET=$_TEAM_LEAGUE_DATASET'
- 'TEAM_STATS_TABLE=$_TEAM_STATS_TABLE'
- 'JOB_TYPE=$_JOB_TYPE'
- 'FAILURE_OUTPUT_DATASET=$_FAILURE_OUTPUT_DATASET'
- 'FAILURE_OUTPUT_TABLE=$_FAILURE_OUTPUT_TABLE'
- 'FAILURE_FEATURE_NAME=$_FAILURE_FEATURE_NAME'
  • The step is from the official cloud-sdk image
  • It set env variables needed for the run part
  • The values of env variables are substitutions
  • This step executes run_dataflow_job.sh Shell script :
#!/usr/bin/env bash

set -e
set -o pipefail
set -u

echo "#######Run the Dataflow Flex Template pipeline"

gcloud dataflow flex-template run "$JOB_NAME-$(date +%Y%m%d-%H%M%S)" \
--template-file-gcs-location "$METADATA_TEMPLATE_FILE_PATH" \
--project="$PROJECT_ID" \
--region="$LOCATION" \
--temp-location="$TEMP_LOCATION" \
--staging-location="$STAGING_LOCATION" \
--parameters serviceAccount="$SA_EMAIL" \
--parameters inputJsonFile="$INPUT_FILE" \
--parameters inputFileSlogans="$SIDE_INPUT_FILE" \
--parameters teamLeagueDataset="$TEAM_LEAGUE_DATASET" \
--parameters teamStatsTable="$TEAM_STATS_TABLE" \
--parameters jobType="$JOB_TYPE" \
--parameters failureOutputDataset="$FAILURE_OUTPUT_DATASET" \
--parameters failureOutputTable="$FAILURE_OUTPUT_TABLE" \
--parameters failureFeatureName="$FAILURE_FEATURE_NAME"

This command uses the following parameters :

  • The Dataflow job name
  • template-file-gcs-location : the Flex Template Json file
METADATA_TEMPLATE_FILE_PATH=gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json
  • All the other parameters are the usual pipeline options for a Dataflow job
  • It’s worth noting that all the custom pipeline options are preceded by parameters

This Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :

gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="run-dataflow-template-team-league-java" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-run-job.yaml" \
--substitutions _JOB_NAME="team-league-java",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_TEMP_LOCATION="gs://mazlum_dev/dataflow/temp",_STAGING_LOCATION="gs://mazlum_dev/dataflow/staging",_SA_EMAIL="sa-dataflow-dev@gb-poc-373711.iam.gserviceaccount.com",_INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_teams_stats_raw.json",_SIDE_INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_team_slogans.json",_TEAM_LEAGUE_DATASET="mazlum_test",_TEAM_STATS_TABLE="team_stat",_JOB_TYPE="team_league_java_ingestion_job",_FAILURE_OUTPUT_DATASET="mazlum_test",_FAILURE_OUTPUT_TABLE="job_failure",_FAILURE_FEATURE_NAME="team_league" \
--verbosity="debug"

9. Focus on Cloud Build Triggers

In this part we will show the generated Cloud Build Triggers in the console.

Before to create Triggers, we need to connect Cloud Build to a Github repository :

Then run a gcloud command to create a Trigger (you can check from the previous section), example for the deployment of the Flex Template and manual task :

gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="deploy-dataflow-template-team-league-java" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-deploy-job.yaml" \
--substitutions _REPO_NAME="internal-images",_IMAGE_NAME="dataflow/team-league-java",_IMAGE_TAG="latest",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_SDK_LANGUAGE="JAVA",_FLEX_TEMPLATE_BASE_IMAGE="JAVA11",_METADATA_FILE="-",_JAR="target/teams-league-0.1.0.jar",_FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp" \
--verbosity="debug"

10. Useful links

All the code shared on this article is accessible from my Github repository :

Conclusion

Flex Template is a very interesting feature to standardize the deployment of Dataflow jobs. It’s based on a Docker image and a Cloud Storage bucket.

No matter the Beam language and sdk, the way to deploy is the same.

It is also very important to correctly understand the two possible approaches :

  • One with a Dockerfile and all the dependencies installed in the container
  • The other with the flex-template command to generate the Docker image and spec file in GCS

We used Cloud Build in this example for simplicity, the leightweight aspect and serverless approach, but the tasks, logic and gcloud commands could be easily reused in other CI CD tools like Gitlab CI or Github Actions.

In another article, we will present the same use case but with Apache Beam Python SDK and to confirm the benefits of this standardization offered by Flex Template.

If you like my articles and want to see my posts, follow me on :

- Medium
-
Twitter
-
LinkedIn

--

--

Mazlum Tosun
Google Cloud - Community

GDE Cloud | Head of Data & Cloud GroupBees | Data | Serverless | IAC | Devops | FP