Dynamically Insert data from one table to another in Google BigQuery

Abhik Saha
4 min readApr 24, 2023

--

Imagine a scenario in which you need to transfer data from tables in your development/testing dataset to tables in your production dataset, but you lack permission to create new tables in the production dataset.

This is not an ideal scenario as Google recommends different projects to differentiate your test and production data but for simplicity purpose, we will use the same project but with different datasets.

image credits: dataedo.com

Since you do not have the create table access in your prod dataset, a simple CTAS statement will not work. The CTAS (Create Table As Select ) statement also pose a risk that it might change the schema of the prod table on top of which other measures and processes are dependent on. It will cause those dependent processes to fail. Also you might lose the existing data that is already residing in those prod tables.

The only option is to have insert statements. Now these insert statements might be exceptionally long for fact and measure tables.

Also if there are a number of tables whose data you need to copy, we will need to write a lot of insert statements. This increasing the chances of error in your code. Also it will take an awfully lot of time.

To overcome this, we can use dynamic inserts using BigQuery INFORMATION_SCHEMA and execute them inside an “EXECUTE IMMEDIATE” block. Let me demonstrate the same with the below example.

CREATE OR REPLACE PROCEDURE `project-name.prod_dataset.sp_all_orders_table`()
BEGIN
DECLARE log_date string;
DECLARE custom_message string;
DECLARE v_project_name string;
DECLARE v_dataset_name string;
DECLARE v_table_name string;
DECLARE v_truncate_query string;
DECLARE final_query string;

BEGIN
SET custom_message ='Error during dynamic inserts';

FOR source_tables IN
(
/* I have declared a simple CTE to take example of two identical tables
whose data I need to copy. You can have a config table in it's place*/

with base_tables AS
(
select 'project-name' as project_name,'manual_input' as dataset_name,'all_orders_table' as table_name union all
select 'project-name' as project_name,'manual_input' as dataset_name,'hockey_data' as table_name
)
select project_name,dataset_name,table_name from base_tables

)
DO

SET v_project_name = source_tables.project_name;
SET v_dataset_name = source_tables.dataset_name;
SET v_table_name = source_tables.table_name;
SET v_truncate_query = FORMAT("""TRUNCATE TABLE `%s.%s.%s`;""",v_project_name,'prod_dataset',v_table_name);

select v_truncate_query;

SET final_query = FORMAT("""INSERT into `%s.%s.%s` (%s) select %s from `%s.%s.%s`;""",
v_project_name,
'prod_dataset',
v_table_name,
(SELECT STRING_AGG( upper(column_name), ',')
from(
SELECT column_name
FROM
`project-name.prod_dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name =v_table_name
order by ordinal_position )),

(SELECT STRING_AGG( upper(column_name), ',')
from(
SELECT concat('SAFE_CAST(',column_name,' AS ',data_type, ') AS ',column_name) as column_name
FROM
`project-name.prod_dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name =v_table_name
order by ordinal_position )),
v_project_name,
v_dataset_name,
v_table_name);

select final_query;



EXECUTE IMMEDIATE v_truncate_query;
EXECUTE IMMEDIATE final_query;

END FOR;

EXCEPTION WHEN ERROR THEN

SET LOG_DATE=( SELECT CAST(CURRENT_DATETIME() AS STRING) AS LOG_DATE);


EXECUTE IMMEDIATE "insert into `project-name.manual_input.error_log_table` (LOG_DATE, PROJECT_NAME, DATASET_NAME, PROCEDURE_NAME,ERROR_STATEMENT_TEXT, ERROR_MESSAGE, CUSTOM_MESSAGE) values (?,?,?,?,?,?,?)" USING LOG_DATE,'project-name','prod_dataset','sp_all_orders_table',@@error.statement_text,@@error.message,custom_message;

END;
END;
The procedure runs successfully

Code Explaination

  1. It is a simple procedure which reads a CTE and a for loop takes the project name, the dataset name and the table names as inputs. We are storing the same inputs in three different variables.
  2. Then we are creating the query for truncating the destination table using a dynamic “EXECUTE IMMEDIATE” statement. This step is optional. Since our prod dataset is daily-refresh, we can chose to truncate it.
  3. Then we are forming the dynamic insert statements. To do we are reading the prod_dataset’s columns arranged according to ordinal_position and then ‘STRING_AGG’ing them using commas (,). We are doing the same for the select statement as well but here we are ‘SAFE_CAST’ing them to prevent a value mismatch in case the value from source does not match the destination table’s schema.
How the inserts look

Please note that I am reading only the destination table’s columns and not the source. This is done to prevent any mismatch if you add new columns to your source tables.

4. Then we executed both the truncate and insert statement consecutively.

5. This for loop continues until it exhausts all the tables in the source CTE.

6. To know more about the error handing block, read the article here.

7. The dynamic inserts are now complete.

Why Production Access is Restricted

Developers are typically not given write access to production tables for several reasons:

  1. Security: Production databases contain sensitive and critical data that needs to be protected from unauthorized access or changes. By restricting write access to production tables, organizations can minimize the risk of accidental or intentional data breaches, data loss, or corruption.
  2. Governance: Production databases are subject to various regulatory and compliance requirements that dictate who can access, modify, or delete data. By limiting write access to authorized personnel only, organizations can ensure compliance with these regulations and prevent potential legal or financial liabilities.
  3. Testing: Developers need a safe and controlled environment to test their code and validate its functionality before it goes into production. By creating separate development and testing environments that mirror the production environment, developers can experiment with different scenarios, troubleshoot issues, and refine their code without affecting live data.
  4. Collaboration: By separating development and production environments, developers can work independently on their tasks without interfering with each other’s work or causing conflicts. This allows for better collaboration and coordination among team members and facilitates the use of version control and other software development best practices.

Overall, restricting write access to production tables is a best practice that helps organizations protect their data, ensure compliance, and promote collaboration and testing in a safe and controlled environment.

Follow me on LinkedIn and Medium to get more content like these!

--

--

Abhik Saha

Data Engineer @Accenture India || Writes about Bigquery, Cloud Function, GCP, SQL || LinkedIn ID: https://www.linkedin.com/in/abhik-saha-919646108/