CI CD for Dataflow Java with Flex Templates and Cloud Build
1. Explanation of the use case presented in this article
The goal of this article is showing a complete use case with a CI CD pipeline to :
- Build Apache Beam Java job
- Launch unit tests
- Deploy Apache Beam / Dataflow job
- Give the possibility to launch the job
All these steps will be orchestrated with Cloud Build.
The deployment of the Beam and Dataflow job is based on Dataflow Flex Template, that is a way to standardize the deployment of Dataflow pipelines based on a Docker image.
There are two approaches to deploy a Flex Template :
- Use a
Dockerfile
with all the dependencies installed in the container, then create the spec file to a Cloud Storage bucket. In this case theflex-template build
command will only create the spec file in GCS. - Do not use a
Dockerfile
and let theflex-template build
command to generate the Docker image and create the spec file in GCS
In the Google Cloud official documentation, we didn’t saw an example with a Dockerfile
for the Java SDK. The method with a Dockerfile
is very interesting because we have more flexibility to create the environement with all the needed dependencies, that’s why we wanted to show these two methods.
I also created a video on this topic in my GCP Youtube channel, feel free to subscribe to the channel to support my work for the Google Cloud community :
English version
French version
2. Structure of the project
- The
src
folder contains theBeam
Java
job - The config folder contains the Flex Template config and metadata
- The
scripts
folder contains a Shell scripts to deploy and run the Dataflow job and Template dataflow-run-tests.yaml
Cloud Build file to launch unit testsdataflow-deploy-template.yaml
Cloud Build file to generate and build the Flex Template Docker image and create the spec file to Cloud Storage without aDockerfile
dataflow-deploy-template-dockerfile-all-dependencies.yaml
Cloud Build file to build the Flex Template Docker image and create the spec file to Cloud Storage. This example contains aDockerfile
with all the needed dependencies installed in the containerdataflow-run-job.yaml
Cloud Build file to run the template and Dataflow jobDockerfile
used by the Cloud Build file with a Dockerfile and all the dependencies installed in the container. The image is published in Artifact Registrypom.xml
the Maven pom file to build theBeam
project, run unit tests and generate the fat jar
3. Set environement variables
Before to create the Cloud Build Triggers, set the following environement variables from your Shell session :
export PROJECT_ID={{project_id}}
export LOCATION={{location}}
export REPO_NAME=internal-images
export IMAGE_NAME="dataflow/team-league-java"
export IMAGE_TAG=latest
export METADATA_FILE="config/metadata.json"
export METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json"
export SDK_LANGUAGE=JAVA
export FLEX_TEMPLATE_BASE_IMAGE=JAVA11
export JAR=target/teams-league-0.1.0.jar
export FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp"
export JOB_NAME="team-league-java"
export TEMP_LOCATION=gs://mazlum_dev/dataflow/temp
export STAGING_LOCATION="gs://mazlum_dev/dataflow/staging"
export SA_EMAIL={{your_sa_email}}
export INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_teams_stats_raw.json"
export SIDE_INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_team_slogans.json"
export TEAM_LEAGUE_DATASET=mazlum_test
export TEAM_STATS_TABLE=team_stat
export JOB_TYPE=team_league_java_ingestion_job
export FAILURE_OUTPUT_DATASET=mazlum_test
export FAILURE_OUTPUT_TABLE=job_failure
export FAILURE_FEATURE_NAME=team_league
4. Flex Template with a Dockerfile and the dependencies installed in the container
4.1 The Dockerfile
The Flex Template is based on a Docker image to start the Dataflow job
In this example, we use a Dockerfile
and install all the dependencies in the container.
Note that the Flex Template container built using the Dockerfile is used only to create a job graph and start the Dataflow job. The packages installed in Flex Template containers are not available in the Beam container.
FROM maven:3.8.6-openjdk-11-slim AS build
ADD . /app
WORKDIR /app
USER root
RUN mvn clean package -Dmaven.test.skip=true
FROM gcr.io/dataflow-templates-base/java11-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
COPY --from=build /app/target/teams-league-0.1.0.jar ${WORKDIR}/target/
ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp"
ENV FLEX_TEMPLATE_JAVA_CLASSPATH="${WORKDIR}/target/teams-league-0.1.0.jar"
ENTRYPOINT ["/opt/google/dataflow/java_template_launcher"]
The first stage build the needed dependencies and create the fat jar.
The second stage copies the fat jar from the first stage to the working directory of the second stage.
We need to specify :
- The Java Main class with
FLEX_TEMPLATE_JAVA_MAIN_CLASS
env variable - The Java classpath targeting on the fat jat with
FLEX_TEMPLATE_JAVA_CLASSPATH
4.2 Build the Dockerfile with all the dependencies and create the Flex Template spec file with Cloud Build
Here is the use case diagram of this method :
In this case, there is a Dockerfile
with all the dependencies installed in the container.
The dataflow-deploy-template-dockerfile-all-dependencies.yaml
Cloud Build file allows to build and publish the Dockerfile
and image to Artifact Registry and create the spec file in GCS :
steps:
- name: google/cloud-sdk:420.0.0-slim
entrypoint: 'bash'
args:
- '-c'
- |
./scripts/build_image_with_dockerfile.sh \
&& ./scripts/create_flex_template_spec_file_gcs.sh
env:
- 'PROJECT_ID=$PROJECT_ID'
- 'LOCATION=$LOCATION'
- 'REPO_NAME=$_REPO_NAME'
- 'IMAGE_NAME=$_IMAGE_NAME'
- 'IMAGE_TAG=$_IMAGE_TAG'
- 'METADATA_TEMPLATE_FILE_PATH=$_METADATA_TEMPLATE_FILE_PATH'
- 'SDK_LANGUAGE=$_SDK_LANGUAGE'
- 'METADATA_FILE=$_METADATA_FILE'
The step is created from gcloud SDK Docker image and two bash scripts are executed :
build_image_with_dockerfile.sh
:
#!/usr/bin/env bash
set -e
set -o pipefail
set -u
echo "#######Building Dataflow Flex Template Docker image with all the dependencies installed inside"
gcloud builds submit --tag "$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$IMAGE_TAG" .
This script builds and publishes the Docker image to artifact registry with gcloud builds submit
command.
create_flex_template_spec_file_gcs.sh
:
#!/usr/bin/env bash
set -e
set -o pipefail
set -u
echo "#######Creating image and spec file with flex-template-build"
gcloud dataflow flex-template build "$METADATA_TEMPLATE_FILE_PATH" \
--image "$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$IMAGE_TAG" \
--sdk-language "$SDK_LANGUAGE" \
--metadata-file "$METADATA_FILE"
⚠️ this script only creates the Flex Template spec file in the Cloud Storage bucket.
The path is given by the METADATA_TEMPLATE_FILE_PATH
env variable, example :
METADATA_TEMPLATE_FILE_PATH=gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.jsonThis Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :
This Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :
gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="deploy-dataflow-template-team-league-java-dockerfile" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-deploy-template-dockerfile-all-dependencies.yaml" \
--substitutions _REPO_NAME="internal-images",_IMAGE_NAME="dataflow/team-league-java",_IMAGE_TAG="latest",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_SDK_LANGUAGE="JAVA",_METADATA_FILE="config/metadata.json" \
--verbosity="debug"
- External variables are passed with substitutions in Cloud Build
- Then in the YAML file, env variables are created with substitutions as values
5. Flex Template without a Dockerfile and generation of the Docker image with flex-template command
Here is the use case diagram of this method :
In this case, there is no Dockerfile
, the Docker image and spec file are generated by the flex-template build
command
The dataflow-deploy-template.yaml
Cloud Build file is used :
steps:
- name: maven:3.8.6-openjdk-11-slim
script: |
mvn clean package
- name: google/cloud-sdk:420.0.0-slim
args: [ './scripts/build_image_and_create_spec_file_with_template_build.sh' ]
env:
- 'PROJECT_ID=$PROJECT_ID'
- 'LOCATION=$LOCATION'
- 'REPO_NAME=$_REPO_NAME'
- 'IMAGE_NAME=$_IMAGE_NAME'
- 'IMAGE_TAG=$_IMAGE_TAG'
- 'METADATA_TEMPLATE_FILE_PATH=$_METADATA_TEMPLATE_FILE_PATH'
- 'SDK_LANGUAGE=$_SDK_LANGUAGE'
- 'FLEX_TEMPLATE_BASE_IMAGE=$_FLEX_TEMPLATE_BASE_IMAGE'
- 'METADATA_FILE=$_METADATA_FILE'
- 'JAR=$_JAR'
- 'FLEX_TEMPLATE_JAVA_MAIN_CLASS=$_FLEX_TEMPLATE_JAVA_MAIN_CLASS'
The fat jar is generated in the first step and given to the flex-template
command.
⚠️ the fat jar should be built with the same sdk and Java version used in the Flex Template (Docker image and Template creation). JAVA11
is used everywhere.
The second step executes the script build_image_and_create_spec_file_with_template_build.sh
:
#!/usr/bin/env bash
set -e
set -o pipefail
set -u
echo "#######Creating Dataflow Flex Template"
gcloud dataflow flex-template build "$METADATA_TEMPLATE_FILE_PATH" \
--image-gcr-path "$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$IMAGE_NAME:$IMAGE_TAG" \
--sdk-language "$SDK_LANGUAGE" \
--flex-template-base-image "$FLEX_TEMPLATE_BASE_IMAGE" \
--metadata-file "$METADATA_FILE" \
--jar "$JAR" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="$FLEX_TEMPLATE_JAVA_MAIN_CLASS"
- This command generate the Docker image unlike in the previous section
- The fat jar is given with
jar
parameter :target/team-league-0.1–0.jar
- The
flex-template-base-image
is mandatory to generate the Docker image :JAVA11
- The
FLEX_TEMPLATE_JAVA_MAIN_CLASS
env variable must be passed :fr.groupbees.application.TeamLeagueApp
- All the other parameters are mandatories like in the previous section :
image-gcr-path
sdk-language
metadata-file
- The command also create the Flex Template spec file in GCS
This Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :
gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="deploy-dataflow-template-team-league-java" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-deploy-job.yaml" \
--substitutions _REPO_NAME="internal-images",_IMAGE_NAME="dataflow/team-league-java",_IMAGE_TAG="latest",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_SDK_LANGUAGE="JAVA",_FLEX_TEMPLATE_BASE_IMAGE="JAVA11",_METADATA_FILE="config/metadata.json",_JAR="target/teams-league-0.1.0.jar",_FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp" \
--verbosity="debug"
6. The Beam job
The src
folder contains the Beam
job built by the pom.xml
file.
This job is the same as I presented in the Beam Summit :
7. Run unit tests
The dataflow-run-tests.yaml
Cloud Build file allows to launch unit tests with Maven :
steps:
- name: maven:3.8.6-openjdk-11-slim
script: |
mvn test
- The step is from
maven
official image. - The
mvn test
executes the unit tests.
This Cloud Build job is launched when changes are pushed in the Github
repository on any branch, via automatic Trigger :
gcloud beta builds triggers create github \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="launch-dataflow-unit-tests-team-league-java" \
--repo-name=dataflow-java-ci-cd \
--repo-owner=tosun-si \
--branch-pattern=".*" \
--build-config=dataflow-run-tests.yaml \
--include-logs-with-status \
--verbosity="debug"
8. Run the Flex Template and Dataflow job
The dataflow-run-job.yaml
Cloud Build file allows to run the Flex Template and the job :
steps:
- name: google/cloud-sdk:420.0.0-slim
args: [ './scripts/run_dataflow_job.sh' ]
env:
- 'PROJECT_ID=$PROJECT_ID'
- 'LOCATION=$LOCATION'
- 'JOB_NAME=$_JOB_NAME'
- 'METADATA_TEMPLATE_FILE_PATH=$_METADATA_TEMPLATE_FILE_PATH'
- 'TEMP_LOCATION=$_TEMP_LOCATION'
- 'STAGING_LOCATION=$_STAGING_LOCATION'
- 'SA_EMAIL=$_SA_EMAIL'
- 'INPUT_FILE=$_INPUT_FILE'
- 'SIDE_INPUT_FILE=$_SIDE_INPUT_FILE'
- 'TEAM_LEAGUE_DATASET=$_TEAM_LEAGUE_DATASET'
- 'TEAM_STATS_TABLE=$_TEAM_STATS_TABLE'
- 'JOB_TYPE=$_JOB_TYPE'
- 'FAILURE_OUTPUT_DATASET=$_FAILURE_OUTPUT_DATASET'
- 'FAILURE_OUTPUT_TABLE=$_FAILURE_OUTPUT_TABLE'
- 'FAILURE_FEATURE_NAME=$_FAILURE_FEATURE_NAME'
- The step is from the official
cloud-sdk
image - It set env variables needed for the run part
- The values of env variables are substitutions
- This step executes
run_dataflow_job.sh
Shell script :
#!/usr/bin/env bash
set -e
set -o pipefail
set -u
echo "#######Run the Dataflow Flex Template pipeline"
gcloud dataflow flex-template run "$JOB_NAME-$(date +%Y%m%d-%H%M%S)" \
--template-file-gcs-location "$METADATA_TEMPLATE_FILE_PATH" \
--project="$PROJECT_ID" \
--region="$LOCATION" \
--temp-location="$TEMP_LOCATION" \
--staging-location="$STAGING_LOCATION" \
--parameters serviceAccount="$SA_EMAIL" \
--parameters inputJsonFile="$INPUT_FILE" \
--parameters inputFileSlogans="$SIDE_INPUT_FILE" \
--parameters teamLeagueDataset="$TEAM_LEAGUE_DATASET" \
--parameters teamStatsTable="$TEAM_STATS_TABLE" \
--parameters jobType="$JOB_TYPE" \
--parameters failureOutputDataset="$FAILURE_OUTPUT_DATASET" \
--parameters failureOutputTable="$FAILURE_OUTPUT_TABLE" \
--parameters failureFeatureName="$FAILURE_FEATURE_NAME"
This command uses the following parameters :
- The
Dataflow
job name template-file-gcs-location
: the Flex Template Json file
METADATA_TEMPLATE_FILE_PATH=gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json
- All the other parameters are the usual pipeline options for a
Dataflow
job - It’s worth noting that all the custom pipeline options are preceded by
parameters
This Cloud Build job is launched from the main branch on the given Github repository, via a manual Trigger :
gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="run-dataflow-template-team-league-java" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-run-job.yaml" \
--substitutions _JOB_NAME="team-league-java",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_TEMP_LOCATION="gs://mazlum_dev/dataflow/temp",_STAGING_LOCATION="gs://mazlum_dev/dataflow/staging",_SA_EMAIL="sa-dataflow-dev@gb-poc-373711.iam.gserviceaccount.com",_INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_teams_stats_raw.json",_SIDE_INPUT_FILE="gs://mazlum_dev/team_league/input/json/input_team_slogans.json",_TEAM_LEAGUE_DATASET="mazlum_test",_TEAM_STATS_TABLE="team_stat",_JOB_TYPE="team_league_java_ingestion_job",_FAILURE_OUTPUT_DATASET="mazlum_test",_FAILURE_OUTPUT_TABLE="job_failure",_FAILURE_FEATURE_NAME="team_league" \
--verbosity="debug"
9. Focus on Cloud Build Triggers
In this part we will show the generated Cloud Build Triggers in the console.
Before to create Triggers, we need to connect Cloud Build to a Github repository :
Then run a gcloud
command to create a Trigger (you can check from the previous section), example for the deployment of the Flex Template and manual task :
gcloud beta builds triggers create manual \
--project=$PROJECT_ID \
--region=$LOCATION \
--name="deploy-dataflow-template-team-league-java" \
--repo="https://github.com/tosun-si/dataflow-java-ci-cd" \
--repo-type="GITHUB" \
--branch="main" \
--build-config="dataflow-deploy-job.yaml" \
--substitutions _REPO_NAME="internal-images",_IMAGE_NAME="dataflow/team-league-java",_IMAGE_TAG="latest",_METADATA_TEMPLATE_FILE_PATH="gs://mazlum_dev/dataflow/templates/team_league/java/team-league-java.json",_SDK_LANGUAGE="JAVA",_FLEX_TEMPLATE_BASE_IMAGE="JAVA11",_METADATA_FILE="-",_JAR="target/teams-league-0.1.0.jar",_FLEX_TEMPLATE_JAVA_MAIN_CLASS="fr.groupbees.application.TeamLeagueApp" \
--verbosity="debug"
10. Useful links
All the code shared on this article is accessible from my Github
repository :
Conclusion
Flex Template is a very interesting feature to standardize the deployment of Dataflow jobs. It’s based on a Docker image and a Cloud Storage bucket.
No matter the Beam language and sdk, the way to deploy is the same.
It is also very important to correctly understand the two possible approaches :
- One with a Dockerfile and all the dependencies installed in the container
- The other with the
flex-template
command to generate the Docker image and spec file in GCS
We used Cloud Build in this example for simplicity, the leightweight aspect and serverless approach, but the tasks, logic and gcloud commands could be easily reused in other CI CD tools like Gitlab CI or Github Actions.
In another article, we will present the same use case but with Apache Beam Python SDK and to confirm the benefits of this standardization offered by Flex Template.
If you like my articles and want to see my posts, follow me on :