Schedule Your Dataflow Batch Jobs With Cloud Scheduler

Zhong Chen
3 min readFeb 6, 2020

--

I am glad that the tutorial has received a lot of attractions from readers, so I have published it on the GCP community tutorial site.

Cloud Dataflow is a managed service for handling both streaming and batch jobs. For your streaming jobs, you just need to launch them once without worrying about operating them afterwards. However, for your batch jobs, you probably need to trigger them based on certain conditions.

In the post, I will show you how you can leverage cloud scheduler to schedule your Dataflow batch jobs. You can find all the code in this repo.

First things first, to be able to run your Dataflow jobs on a regular basis, you need to build your Dataflow templates. Follow the instructions to create your templates and save them in a GCS bucket.

Upload Dataflow templates in a GCS bucket

Once you have your templates ready, you can set up cloud schedulers to trigger Dataflow templates.

Set up your cloud scheduler

If you use Terraform, here is one example to define a scheduler.

data "google_project" "project" {}
resource "google_cloud_scheduler_job" "scheduler" {
name = "scheduler-demo"
schedule = "0 0 * * *"
# This needs to be us-central1 even if the app engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"


http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}

# need to encode the string
body = base64encode(<<-EOT
{
"jobName": "test-cloud-scheduler",
"parameters": {
"region": "${var.region}",
"autoscalingAlgorithm": "THROUGHPUT_BASED",
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://zhong-gcp/temp",
"zone": "us-west1-a"
}
}
EOT
)
}
}

Afterwards you are all set up! Run a scheduler and watch it trigger your Dataflow job.

See the status of your jobs

Recap

  1. It is feasible to trigger a Dataflow batch job directly from the cloud scheduler directly. It is easy and fast. There is no need to use cloud function for that.
  2. Cloud schedulers need to be created in the same region of App engine. In your Terraform script, make sure assigning the right value for the region field. You need to use us-central1 if your app engine lives in us-central.
  3. Use the regional endpoint to specify the region of Dataflow job. If you don’t explicitly set the location in the request, the jobs will be created in the default region (US-central).

I hope you find this post useful!

--

--

Zhong Chen

Big Data and Analytics Consultant @ Google GCP. Helping customers to modernizing big data infrastructure.