Streaming Data Generator Dataflow Flex Template on GCP

Prathap Reddy
Google Cloud - Community
3 min readJul 16, 2020

We are excited to announce the launch of a new Dataflow Flex Template Streaming Data Generator that supports writing high volume JSON messages continuously to either Google Cloud Pub/Sub Topic or BigQuery or Cloud Storage in various formats (JSON/AVRO/PARQUET) depending on destination. In this blog post, we will briefly discuss the need and usage of the template.

Flex Templates

Before diving into the details of Streaming Data Generator template functionality, let me introduce the templates at a very high level :

Primary goal of the templates is to package Dataflow pipelines in the form of reusable artifacts with support to launch through various channels (UI / CLI / REST API) so that these templates can be leveraged by various teams by supplying required pipeline parameters. In the first version of templates (called as traditional templates) pipelines were staged on GCS and can be launched either from Console, gcloud command or from other services like Cloud Scheduler/Functions etc.

However traditional templates has certain limitations:
1. Lack of support for Dynamic DAGS
2. Large number of I/O’s doesn’t support runtime parameters

Flex templates overcome these limitations. Flex templates packages data flow pipeline code along with application dependencies in the form of docker images and stage the images in Google Container Registry (GCR). Flex template launcher service launches pipeline with parameters supplied in a similar fashion that a user launches pipeline during development.

What is Streaming Data Generator?

Streaming Data Generator template can be used to publish fake JSON messages based on user provided schema at a specified rate (messages per second) to either Google Cloud Pub/Sub Topic or BigQuery or Cloud Storage. JSON Data Generator library used by the pipeline supports various faker functions to be associated with a schema field. Pipeline supports configuration parameters to control the number of messages published per second (i.e QPS) and resource required (via autoscaling)

A few possible use cases are:

  1. Simulate large-scale real-time event publishing to a Pub/Sub topic to measure and determine the number and size of consumers required to process published events.
  2. Generate synthetic data to a BigQuery table or a Cloud Storage bucket to evaluate performance benchmarks or serve as a proof of concept.

Possibilities are endless in terms of the types of events you can generate.

How to run Pipeline?

Pipeline can be launched either from cloud console or via gcloud command.

To launch from cloud console:

  1. Go to the Dataflow page in the Cloud Console
  2. Click Create job from template.

3. Select Streaming Data Generator from the Dataflow template drop-down menu

4. Enter the job name

5. Enter required parameters as shown below:

6. Enter additional parameters such as sink details depending on sink type, output encoding, autoScalingAlgorithm and maxNumWorkers etc

7. Click Run job

To launch pipeline to publish messages to cloud pub/sub using gcloud use below command:

gcloud beta dataflow flex-template run ${JOB_NAME} \
—-project=${YOUR_PROJECT_ID} \
—-region=${REGION_NAME} \
—-template-file-gcs-location=gs://dataflow-templates/latest/flex/Streaming_Data_Generator \
—-parameters \
schemaLocation=${SCHEMA_LOCATION}\
qps=${QPS},\
topic=${PUBSUB_TOPIC}

References

--

--