New Updates on Pub/Sub to BigQuery Dataflow Templates from GCP

Authors: Theodore Siu, Sameer Abhyankar

We are pleased to announce several new features to the Cloud Pub/Sub to BigQuery Template including support for subscriptions(!!!) as well as some error handling improvements. We detail these updates below.

Subscriptions y’all!

In the past, the Pub/Sub to BigQuery Dataflow template only supported reading messages from Pub/Sub topics using the parameter inputTopic . We have created a second Dataflow template that reads messages from Pub/Sub subscriptions using the parameter inputSubscription . We delineate these two templates on the Dataflow console under the CREATE JOB FROM TEMPLATE button as “Cloud Pub/Sub Subscription to BigQuery” and “Cloud Pub/Sub Topic to BigQuery”. The code for generating both of these templates can also be found on Github. Note that a caveat of using subscriptions over topics is that subscriptions are only read once while topics can be read multiple times. Therefore a subscription template cannot support multiple concurrent pipelines reading the same subscription.

Pub/Sub to BigQuery templates are now delineated between subscriptions and topics

The remaining details, once a message is read either from a subscription or a topic, remain mostly the same.

  1. The user specifies an existing BigQuery table for the input messages to land. This is specified using the outputTableSpec parameter. As always with all pipelines, users need to specify a GCS bucket location for writing/staging temp files.
  2. (Optional) If the user wants to do further modifications to his or her messages before BigQuery insertion he or she may include a javascript transform file in Google Cloud Storage. This can be specified using the parameter - javascriptTextTransformGCSPath. The actual function name within the JavaScript file is then specified using the property- javascriptTextTransformFunctionName .
  3. A dead-letter BigQuery Table is automatically created to catch messages that fail due to various reasons including — message schemas that do not match BigQuery table schema, malformed JSON and messages which throw errors while transforming via the JavaScript function. With our latest update we more robustly catch such errors- in the past, many of these errors would retry endlessly. Users can also specify their own dead-letter table using the parameter outputDeadletterTable (see Github for the dead-letter schema).
Example template setup for a PubSub Subscription to BigQuery Dataflow pipeline with a optional javascript function for pub sub message modifications

Future Dataflow Pipelines + Features

Our Dataflow templates can be accessed on the Dataflow UI under the CREATE JOB FROM TEMPLATE button or can be found open sourced on Github for further customization as users need (see this link for the Cloud Pub Sub to BigQuery code). Keep a look out for future continued efforts on Dataflow Templates from the GCP team.