BigQuery Stored Procedures to Automate Snapshots at Dataset Level

Published in

Google Cloud - Community

3 min readOct 30, 2023

Written by Rohan Paithankar & Bihag Kashikar

With the growing use of data warehouse automation tools, the focus of data engineers is shifting towards DataOps. This includes designing robust pipelines to deliver enhancements without the fear of breaking downstream analytics workflows. An important aspect of this process is the snapshot and restore process while pushing changes to datasets.

Google’s white-paper on Continuous Data Integration introduces snapshot/clone as one of the best practices in the deployment workflow.

With the above as the motivation, we began work on one such use-case in BigQuery but found that it only offers snapshots at Table-level. Our challenge was to create snapshots for a defined number of Datasets with a varying number of tables with the option to automate it as a part of the deployment workflow. To overcome this challenge, we leveraged stored procedures in BigQuery.

Solution

BigQuery provides system-defined metadata views called INFORMATION_SCHEMA for every dataset. We utilise it to obtain the list of tables in a particular dataset and assign it to a variable tables. Refer to the query below:

SET tables = (SELECT ARRAY_AGG(table_name) FROM `project_id.dataset_id.INFORMATION_SCHEMA.TABLES`);

We then iterate through the array returned by the above query using a simple for-loop:

FOR table IN (SELECT * FROM UNNEST(tables) as value)
 
 DO
 
 
END FOR;

For every value in the list of tables, we

Create a fully qualified snapshot table name snapshot_name with a target snapshot_dataset_id.

SET snapshot_name=CONCAT("project_id.snapshots_dataset_id.",table.value,"_SNAPSHOT_",STRING(CURRENT_DATE()));

Obtain the fully qualified table name table_name from the loop variable table.value.

SET table_name=CONCAT("project_id.dataset_id.",table.value);

Run a dynamic SQL query using EXECUTE_IMMEDIATE to pass the above variables as parameters.

EXECUTE IMMEDIATE format("""
 CREATE SNAPSHOT TABLE `%s`
 CLONE `%s`
 OPTIONS (
 expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP, INTERVAL 14 DAY))""",snapshot_name,table_name);

Note: The expiration timestamp has been set to 14 days from the snapshot creation timestamp but can be modified as per the use case.

Final stored procedure:

CREATE OR REPLACE PROCEDURE `project_id.dataset_id.snapshot_procedure_name`
BEGIN

 DECLARE tables ARRAY<STRING>;
 DECLARE snapshot_name STRING;
 DECLARE recovery_table_name STRING;

 -- Get the list of tables in the dataset as an array
 SET tables = (SELECT ARRAY_AGG(table_name) FROM `project_id.dataset_id.INFORMATION_SCHEMA.TABLES`);
 
 -- Iterate through all the tables in the dataset
 FOR table IN (SELECT * FROM UNNEST(tables) as value)
 
 DO
 
 -- Set table & snapshot names
 SET snapshot_name=CONCAT(“project_id.snapshots_dataset_id.”,table.value,”_SNAPSHOT_”,STRING(CURRENT_DATE()));
 SET table_name=CONCAT(“project_id.dataset_id.”,table.value);
 
 -- Create Snapshot
 EXECUTE IMMEDIATE format("""
 CREATE SNAPSHOT TABLE `%s`
 CLONE `%s`
 OPTIONS (
 expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP, INTERVAL 14 DAY))""",snapshot_name,table_name);
 
END FOR;

END;

We can also parameterise the stored proc to make it generalised and callable from a CI/CD pipeline, for instance. Here is the parameterised version:

CREATE OR REPLACE PROCEDURE `project_id.dataset_id.snapshot_procedure_name`(project_id STRING, dataset_id STRING, snapshot_dataset_id STRING)
BEGIN
  DECLARE tables ARRAY<STRING>;
  DECLARE snapshot_name STRING;
  DECLARE info_schema STRING;
  DECLARE table_name STRING;
  -- Create the fully qualified name
  SET info_schema = CONCAT(project_id,'.',dataset_id,'.','INFORMATION_SCHEMA.TABLES');
  -- Get the list of tables in the dataset as an array
  EXECUTE IMMEDIATE FORMAT ("""SELECT ARRAY_AGG(CONCAT(table_name)) FROM `%s`""",info_schema)
  INTO tables;

  FOR table IN (SELECT * FROM UNNEST(tables) as value)
  DO
    -- Create fully qualified names for snapshot and table
    SET snapshot_name=CONCAT(project_id,'.',dataset_id,'.',table.value,'_',STRING(CURRENT_DATE()));
    SET table_name=CONCAT(project_id,'.',dataset_id,'.',table.value);
    -- Create Snapshot
    EXECUTE IMMEDIATE format("""
    CREATE SNAPSHOT TABLE `%s`
    CLONE `%s`
    OPTIONS (
      expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP, INTERVAL 14 DAY))""",snapshot_name,table_name);
  
  END FOR;
END

A parameterised stored procedure to restore the tables from snapshots would look something like this:

CREATE OR REPLACE PROCEDURE `project_id.dataset_id.restore_procedure_name`(project_id STRING, dataset_id STRING, snapshot_dataset_id STRING)
BEGIN
  DECLARE tables ARRAY<STRING>;
  DECLARE snapshot_name STRING;
  DECLARE table_name STRING;
  DECLARE latest_snapshot_date STRING;
  DECLARE info_schema STRING;
  DECLARE snapshot_info_schema STRING;
  -- Create the fully qualified name
  SET info_schema = CONCAT(project_id,'.',dataset_id,'.','INFORMATION_SCHEMA.TABLES');
  -- Create the fully qualified name
  SET snapshot_info_schema = CONCAT(project_id,'.',snapshot_dataset_id,'.','INFORMATION_SCHEMA.TABLE_SNAPSHOTS');
  -- Find date of latest snapshot
  EXECUTE IMMEDIATE FORMAT ("""SELECT STRING(MAX(DATE(snapshot_time))) FROM `%s`""",snapshot_info_schema)
  INTO latest_snapshot_date;
  -- Get the list of tables in the main dataset as an array
  EXECUTE IMMEDIATE FORMAT ("""SELECT ARRAY_AGG(CONCAT(table_name)) FROM `%s` WHERE table_type='BASE TABLE'""",info_schema)
  INTO tables;
  FOR table IN (SELECT * FROM UNNEST(tables) as value)
  DO
    -- Create fully qualified names for snapshot and table
    SET snapshot_name=CONCAT(project_id,'.',snapshot_dataset_id,'.',table.value,"_",latest_snapshot_date);
    SET table_name=CONCAT(project_id,'.',dataset_id,'.',table.value);
    
    -- Restore from Snapshot
    EXECUTE IMMEDIATE format("""
    CREATE OR REPLACE TABLE `%s`
    CLONE `%s`""",table_name,snapshot_name);

  END FOR;
END

This is just one of the solutions to automate Dataset-level snapshots and it works well for:

Limited number of datasets with increasing number of tables.
Teams that are SQL-oriented.
Scheduled execution of snapshot and restore.

To conclude, the above solution only utilises a single service, i.e, BigQuery and hence has lesser overhead. It won’t be a surprise if Google Cloud soon releases Dataset-level snapshots out of the box.

A shoutout to Franklin White & Preetika Bhateja who have proposed a more loosely-coupled event-driven solution using Pub/Sub and Cloud Functions which can be referred in this article-

Automate BigQuery Snapshots at Dataset Level

Schedule and automate your table snapshots at dataset level

medium.com

BigQuery Stored Procedures to Automate Snapshots at Dataset Level

Solution

Final stored procedure:

Automate BigQuery Snapshots at Dataset Level

Schedule and automate your table snapshots at dataset level

Written by Rohan Paithankar