Automate the execution of BigQuery queries with Cloud Workflows

Resolve repetitive queries with scheduled invocations, combine simple queries to complex DML statements by using Cloud Workflows instead of BigQuery Scripting.

Automate the execution of BigQuery queries with Cloud Workflows

SQL pipelines! In this article, we will explore how to use Workflows Connectors to run BigQuery query pipelines to execute a query, retrieve query results, loop through the resultset, and repeatedly launch subsequent queries.

Also, you could reuse these techniques to various business logics to execute a SQL pipeline process that combines several consecutive steps.

Introduction

We are going to define a problem that currently is not supported out of the box by BigQuery and is frequently asked on Stackoverflow.

Is there a way to run scheduled DML dynamic queries on wildcard tables such as Firebase analytics dataset?

Often people want to clean up the dataset of their unused events to have smaller tables, lower storage costs.

By trying to execute something like:

We get:
DML over table my-project.analytics_1234567890.* is not supported

As you notice there is a * (star) in the table definition. DML statements over wildcard tables are not supported, but this query runs fine on a single table. Let’s automate this.

To turn this into a working solution we will use Cloud Workflows.

What is Cloud Workflows?

In a nutshell, Workflows allows you to connect services together, anything that has a public API.

  • Workflow orchestration (step engine as a service)
  • Integrate any Google Cloud API, SaaS API, or private APIs
  • Out of the box authentication support for Google Cloud products
  • Fully managed service — requires no infrastructure or capacity planning
  • Serverless with Pay-per-use pricing model
  • Declarative workflow language using YAML syntax
  • Developer friendly built-in error handling with retries

Workflow Steps for our problem

  1. Run a query to get the table names
  2. Loop for each table to launch a task for it
  3. Execute the parameterized task for each result

Main Workflow definition

Run a dynamic query and subsequent queries for each result.

Our main workflow definitions define BQ_Query and BQ_Results_LoopItems subworkflows. Let’s see how they are defined.

Workflow to execute a query and return results

As you may notice the call action is a Workflow Connector, more about the parameters can be found in the Cloud Workflow connector syntax, and on BigQuery API reference.

Workflow to Loop a resultset and run a task

This is a standard foreach loop template that you may have seen in Cloud Workflows. On the process_item internal step, we have the mention of BQ_Task that’s a subworkflow that gets executed for each result set.

BQ_Task requires the table_id parameter which is the table name from the resultset.

The previous step output looks like this JSON structure:

To extract the table names, we need to use ${items[i].f[0].v} path in JSON to get the table name like: events_intraday_20210511

If there are multiple columns in the query, you can address additional columns with the index of the zero-based column. So to get a 2nd column from the resultSet you need to use ${items[i].f[1].v} where “1” is column index.

We are almost done, we need the BQ_Task that’s arbitrary to our problem:

As you see this gets the table_id as a parameter and it uses the DML syntax on a single table which runs successfully on BigQuery. You may notice that it reuses BQ_Query subworkflow.

Deploying and executing Cloud Workflows

To deploy your workflow, you need the YAML file, which’s at the end of the article. You can deploy using Cloud Console, by API or with gcloud command line utility.

We recommend using VSCode as there you can set up the GCP Project Switcher extension, and also to define IDE tasks to automate, deploy, execute, and describe execution.

Once your workflow is deployed, you need to trigger the execution when you want to process your files. This could be a Cloud Scheduler job, manual execution triggered by your script, or by setting up an event observer based on Eventarc or Cloud Function trigger.

When you setup the invoker service account, you need to grant permissions to run BigQuery jobs.

Wrap Up

In the meantime, if you want to check it out, here are some links:

Feel free to reach out to me on Twitter @martonkodok or read my previous posts on medium/@martonkodok

Complete YAML workflow definition.

https://gist.github.com/pentium10/e9ce523bd0be0c83e3d3cdb12026ace3

--

--

--

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Recommended from Medium

Demystifying Recursion

Database Normalization Explained

How one startup boosted productivity by a factor of 10 using Google Flutter

10 free stock photos you would actually use (Friday 13th 08PM edition)

Re-using codes is not entirely bad for Cloud-native apps

How to Commit using Git Bash

Voice-enabling Your App and Content (Fire TV)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Márton Kodok

Márton Kodok

Speaker at conferences, a Google Developer Expert top user on Stackoverflow, software architect at REEA.net, co-founder IT Mures, life-long learner, mentor

More from Medium

Reduce DB upgrade downtime to less than 10 minutes using DMS on Google Cloud

Understanding Google Cloud IAM concepts with stick figures

DIY Google Cloud Storage replication using Cloud Functions

Firebase Function Update Failing?