Automate the execution of BigQuery queries with Cloud Workflows
Resolve repetitive queries with scheduled invocations, combine simple queries to complex DML statements by using Cloud Workflows instead of BigQuery Scripting.
SQL pipelines! In this article, we will explore how to use Workflows Connectors to run BigQuery query pipelines to execute a query, retrieve query results, loop through the resultset, and repeatedly launch subsequent queries.
Also, you could reuse these techniques to various business logics to execute a SQL pipeline process that combines several consecutive steps.
Introduction
We are going to define a problem that currently is not supported out of the box by BigQuery and is frequently asked on Stackoverflow.
Is there a way to run scheduled DML dynamic queries on wildcard tables such as Firebase analytics dataset?
Often people want to clean up the dataset of their unused events to have smaller tables, lower storage costs.
By trying to execute something like:
We get: DML over table my-project.analytics_1234567890.* is not supported
As you notice there is a * (star) in the table definition. DML statements over wildcard tables are not supported, but this query runs fine on a single table. Let’s automate this.
To turn this into a working solution we will use Cloud Workflows.
What is Cloud Workflows?
In a nutshell, Workflows allows you to connect services together, anything that has a public API.
- Workflow orchestration (step engine as a service)
- Integrate any Google Cloud API, SaaS API, or private APIs
- Out of the box authentication support for Google Cloud products
- Fully managed service — requires no infrastructure or capacity planning
- Serverless with Pay-per-use pricing model
- Declarative workflow language using YAML syntax
- Developer friendly built-in error handling with retries
Workflow Steps for our problem
- Run a query to get the table names
- Loop for each table to launch a task for it
- Execute the parameterized task for each result
Main Workflow definition
Run a dynamic query and subsequent queries for each result.
Our main workflow definitions define BQ_Query
and BQ_Results_LoopItems
subworkflows. Let’s see how they are defined.
Workflow to execute a query and return results
As you may notice the call action is a Workflow Connector, more about the parameters can be found in the Cloud Workflow connector syntax, and on BigQuery API reference.
Workflow to Loop a resultset and run a task
This is a standard foreach
loop template that you may have seen in Cloud Workflows. On the process_item
internal step, we have the mention of BQ_Task
that’s a subworkflow that gets executed for each result set.
BQ_Task
requires the table_id
parameter which is the table name from the resultset.
The previous step output looks like this JSON structure:
To extract the table names, we need to use ${items[i].f[0].v}
path in JSON to get the table name like: events_intraday_20210511
If there are multiple columns in the query, you can address additional columns with the index of the zero-based column. So to get a 2nd
column from the resultSet you need to use ${items[i].f[1].v}
where “1” is column index.
We are almost done, we need the BQ_Task
that’s arbitrary to our problem:
As you see this gets the table_id
as a parameter and it uses the DML syntax on a single table which runs successfully on BigQuery. You may notice that it reuses BQ_Query
subworkflow.
Deploying and executing Cloud Workflows
To deploy your workflow, you need the YAML file, which’s at the end of the article. You can deploy using Cloud Console, by API or with gcloud
command line utility.
We recommend using VSCode as there you can set up the GCP Project Switcher extension, and also to define IDE tasks to automate, deploy, execute, and describe execution.
Once your workflow is deployed, you need to trigger the execution when you want to process your files. This could be a Cloud Scheduler job, manual execution triggered by your script, or by setting up an event observer based on Eventarc or Cloud Function trigger.
When you setup the invoker
service account
, you need to grant permissions to run BigQuery jobs.
Wrap Up
In the meantime, if you want to check it out, here are some links:
- Using Cloud Workflows to load Cloud Storage files into BigQuery
- Introducing “GCP Project Switcher” Extension for VS Code — My Weekend Project
- Run shell commands and orchestrate Compute Engine VMs with Cloud Workflows
- Firestore Backups the easy way with Cloud Workflows
Feel free to reach out to me on Twitter @martonkodok or read my previous posts on medium/@martonkodok
Complete YAML workflow definition.
https://gist.github.com/pentium10/e9ce523bd0be0c83e3d3cdb12026ace3