How to use GCP cloud build trigger GCP composer’s dataflow job
I am working as data engineer in a Fintech company. My manager assigned me a task which is deploying a data pipline in google cloud platform.
This pipeline looks like the picure below
general idea of data pipeline
When there is pull request in github repository, github will trigger cloud build using webhook set up previously. Cloud build simple trigger google SDK command line gsutil rsync
sync github repo to the same folder in google composer DAG file located in composer google storage.
In cloud build, I also added other command which will trigger composer job in composer airflow after sync finish to update google storage repo.
This airflow job will actually run a simple dataflow job and eventually data will be saved to google bigquery.
There is lots of articles related to sync github repo to DAG google storage. This article will focus on how to use cloud build trigger composer airflow job.
sync github repo to google storage through cloud build
You can easily use web interface for setting up a trigger for github. However, most settings should be in yaml file and json file located in your repo. Here is the setting for my repo’s cloudbuild.yaml file
steps:- name: gcr.io/cloud-builders/gsutil
id: Sync github repo to DAGs folder
args: ["-m", "rsync", "-r", "-d", "./dags", "gs://to/my/DAGfolder/in/google/storage"]
The actuall command in google SDK should be
gsutil rsync -mrd ./dags gs://to/my/DAGfolder/in/google/storage
make sure set up _GSC_BUCKET
environment varible to make sure you are running source folder in current directory.
It gave me error if I put rsync
first in args. When I put -m
first it works.
use cloud build trigger GCP composer
in the cloud build google official document, there is not much articles related to trigger composer.
It only contains GKE, Cloud run, App Engine, Cloud Functions, and Firebase.
I think it can be achived becase I can run google sdk to trigger composer. In yaml file, I added steps like that.
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: 'gcloud'
args: ["composer", "environments", "run","composer-cluster-name", "--location=your-location","--project=project-ID", "trigger_dag", "--", "AIRFLOW_ID"]
this yaml equal to google SDK command like
gcloud composer environments run composer-cluster-name --location=your-location --project=project-ID trigger_dag -- AIRFLOW_ID
When I trigger this build job. I got error like that
Seems like I did not properly set up privilege for cloud build cluster.
Because of cloud build using service account to access other cluster, go to IAM&Admin IAM page, add Composer User role into cloudbuild-service-account@cloudbuild.gserviceaccount.com.
After that the problem solved.
I did not mention dataflow and bigquery part. If you have questions, feel free to leave some comments.